Home > PHP, PHP Code Examples, Programming > Writing Website Scrapers in PHP

Writing Website Scrapers in PHP

February 26th, 2008 admin Leave a comment Go to comments

This article discusses about how to write a website scraper using PHP for web site data extraction. The concepts taught can be applied and programmed in Java, C#, etc. Basically any language that has a powerful string processing capability. This article will teach you the basics of website scraping. The article will further cover a tutorial to find web ranking from Yahoo.com search engine.

Steps involved to write a scraping program

  1. Visit the URL
  2. Understand the pattern
  3. Validate the structure of pattern on different URLs
  4. Write the program
  5. Test the program using various input parameters

Lets visit each of these steps one bit at a time.

1. Visit the URL

For this tutorial, we will extract Yahoo’s “Today’s Top Searches” section towards the end of their home page (http://www.yahoo.com/).

2. Understand the pattern

Before you begin to write a web scraping program, its important to understand the pattern of the data that you wish to extract. View the page source to understand the pattern. Mentioned below.

The string of text that we should parse is given below:

 
<div id="popsearchbd" class="bd">
<ol start=1><li><a href="r/dy/*-http://search.yahoo.com/search?p=Heidi+Klum&cs=bz&fr=fp-buzzmod">Heidi Klum</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Sarah+Larson&cs=bz&fr=fp-buzzmod">Sarah Larson</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Oscar+Videos&cs=bz&fr=fp-buzzmod">Oscar Videos</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Brad+Renfro&cs=bz&fr=fp-buzzmod">Brad Renfro</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Gary+Busey&cs=bz&fr=fp-buzzmod">Gary Busey</a></li></ol><ol start=6><li><a href="r/dy/*-http://search.yahoo.com/search?p=Barack+Obama&cs=bz&fr=fp-buzzmod">Barack Obama</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Razzie+Awards&cs=bz&fr=fp-buzzmod">Razzie Awards</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Raisin+in+the+Sun&cs=bz&fr=fp-buzzmod">Raisin in the Sun</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Stay+Home+Moms&cs=bz&fr=fp-buzzmod">Stay Home Moms</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Net+Neutrality&cs=bz&fr=fp-buzzmod">Net Neutrality</a></li></ol></div>
</div>

The pattern is that each Search Phrase is enclosed within a <li><a></a></li> tag. Therefore, we should parse everything between <a></a> of this text piece to get the text.





3. Validate the structure of pattern on different URLs

If you are writing a script to fetch data that has pagination, you should remember to validate the structure on 3 – 4 pages before you start developing code. The reason behind doing this is that the presentation of the first page could differ in subsequent pages.

4. Write the program

You could use any programming language like Java, C#, PHP, PERL, etc. for this processing. I have used PHP for this example.

 
//change fso() to f sock open (my blog was causing an error)
//change fwt() to f write (my blog was causing an error)
//change fgs() to f gets (my blog was causing an error)
//change fc() to f close (my blog was causing an error)
 
$fp = fso("www.yahoo.com", 80, $errno, $errstr, 30);
if (!$fp) {
    echo "$errstr ($errno)<br />\n";
} else {
	$out = "GET / HTTP/1.1\r\n";
	$out .= "Host: www.yahoo.com\r\n";
	$out .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12\r\n";
	$out .= "Connection: Close\r\n\r\n";
 
	$str = "";
 
	fwt($fp, $out);
	while (!feof($fp)) {
		$str .= fgs($fp, 1024);
	}
	fc($fp);
}
 
$pos = strpos($str, "<div id=\"popsearchbd\"");
$pos = $pos + strlen("<div id=\"popsearchbd\"");
 
if($pos == false) {
	echo "No information available";
}
else {
 
	while(1) {
		$pos = strpos($str, "fp-buzzmod\">", $pos);
 
		if($pos === false) {
			break;
		}
 
		$pos = $pos + strlen("fp-buzzmod\">");
		$temppos = $pos;
		$pos = strpos($str, "</a>", $pos);
 
		$datalength = $pos - $temppos;
 
		$data = substr($str, $temppos , $datalength);
		echo $data;
		echo "\n";
	}
 
}

5. Test the program using various input parameters

You should test your program for all the parameters that the web page can take. I have experienced change in layout & data based on the parameters that are passed.

Notes on processing forms and cookies

Some pages use form data and cookies to render data. In such cases you should remember to check the Request and Response headers and identify what is necessary to get the results that you want. If the page requires a cookie value, you should then use the cookie information in your Request Headers. Look at the note below that I use to inspect Request and Response headers.






Tool to inspect Request and Response Headers

I use Live HTTP Headers (a plug in for FireFox) to check for Request and Response Headers. Visit http://livehttpheaders.mozdev.org for more details. To install this plug in visit http://livehttpheaders.mozdev.org/installation.html and click on the ‘download it’ link on the latest release. Please read the release notes before installing a particular version.

Please feel free to use the comments section down to share other tools that you guys use to monitor Request and Response Headers.

Future Maintenance of the program

From a maintenance perspective you should monitor the page frequently and re-validate the HTML structure. The reason for this is because nothing in this world is constant and so is the website. Design changes could result in a change in the HTML code. I recommend scheduling this activity for at least once a month.

This is all that is there in this tutorial. In the next tutorial, I will guide you step-by-step on how to create a program to check for yahoo.com search engine ranking. Use the subscribe form below to keep yourself updated when the tutorial goes live.

Would you like to check out some of my PHP5 Tutorials as well?


Warning: session_start() [function.session-start]: Cannot send session cookie - headers already sent by (output started at /home/sunilb/www.sunilb.com/wp-content/themes/inove/templates/header.php:29) in /home/sunilb/www.sunilb.com/wp-content/plugins/mycaptcha/MyCaptcha.php on line 41

Warning: session_start() [function.session-start]: Cannot send session cache limiter - headers already sent (output started at /home/sunilb/www.sunilb.com/wp-content/themes/inove/templates/header.php:29) in /home/sunilb/www.sunilb.com/wp-content/plugins/mycaptcha/MyCaptcha.php on line 41
  1. February 26th, 2008 at 23:50 | #1

    I have written a load() to get the data from an URL in PHP. Its very easy to use.

    To do pattern matching, use Regular expression – they will make the job much easier.

  2. February 27th, 2008 at 00:55 | #2

    Very nice article.

  3. March 12th, 2008 at 20:36 | #3

    I enjoyed that article and I believe it answers a lot of questions that many beginners will have.

  4. April 8th, 2008 at 16:18 | #4

    would have a zip file somewhere where we could try that? thanks …

  5. April 9th, 2008 at 22:24 | #5

    Thanks for the article. I would also recommend using regular expression to the matching, specifically preg_match.

    Keep up the posts!

  6. admin
    May 10th, 2008 at 22:09 | #6

    This is an amazing article

  7. May 10th, 2008 at 22:10 | #7

    Amazing article

  8. Jamil sadi
    August 26th, 2008 at 11:08 | #8

    I am little confuse on how this scraper will run. I mean will we send request through browser?
    What if we want our scraper to run 24 hours a day. And not overloading the target site, things like that.

    1. How/Where to run the script. and its going to save the data into the db (mysql).
    2. scraper running 24×7. not overloading target site.

  9. Warrlocke Fernanadez
    November 14th, 2008 at 12:27 | #9

    Helped me much!!!!!!!!!!!

  10. admin
    November 15th, 2008 at 19:41 | #10

    Hi Jamil,

    This article was only to explain to you what it means to scrape and how you can extract data you want.

    You will probably have a target site in mind and develop your string processing logic accordingly.

    As regards your two questions about where to run and 24×7 – this depends on your requirement.

    If you have a site that updates values every 3 hours – you will probably need to run the script every 3 hours.

    But if you want to extract all data from a site – then you need to have a resting period between 2 – 3 hits as the webmaster might understand your intentions.

    Regards,
    Suniil

  11. blaaze
    November 20th, 2008 at 01:55 | #11

    hi everyone.
    can anyone help me here
    actually im looking for a script that need to satisfy these things

    1.it must be in php language
    2.it must work like a robot, when initiated it must run automatically on server side only
    3.it must extract the text of any website that is mentioned in a separate text file
    4.that text file contains the url’s of web sites that is to be parsed(each url in single line)
    5.after parsing each url, the parsed data is to be sended to the pre-embedded email address in the code of that script

    ur help is much appreciated!!
    thanks in advance
    abhilash

  12. Neema Tiwari
    January 6th, 2009 at 18:35 | #12

    hello…..

    I am not getting how this will work????

    Can u please tell me..

    Thanks
    Neema Tiwari

  13. April 6th, 2009 at 15:52 | #13

    good idea but can html be mixed with it to make it a bit easier?

  14. June 2nd, 2009 at 22:11 | #14

    A line by line explanation of the code would be useful.

  15. admin
    June 6th, 2009 at 17:07 | #15

    Hi Jon,

    That is a good idea.

    Till then, let me know what area of the code you need more explanation for?

    Regards,
    Suniil
    http://www.twitter.com/sunilbhatia79 – Follow me

  16. July 5th, 2009 at 07:53 | #16

    Finding the common pattern and writing appropriate scripts are not easy. I have spent for a bout one year to find an approach to generate scripts automatically. Coding is no longer required with the help of a friendly GUI. Now the software can be downloaded for free.

    Download: http://www.gooseeker.com/en/node/download/front

  17. July 8th, 2009 at 08:35 | #17

    i would prefer regex match to your code.. as in long run, the html would keep changing and it would be easier to change a single expression rather than editting the whole stuff.. anyways good job.

  18. August 27th, 2009 at 10:54 | #18

    this tutorial is very good, i will definitely try this.

  19. September 30th, 2009 at 10:51 | #19

    Very Very interesting Article for Logic ,Client management and lots more ..

    Thanks

  20. October 27th, 2009 at 00:24 | #20

    A very neat scraper. Any reason you didn’t just use regex and wget? Is there much performance gain?

  21. admin
    October 27th, 2009 at 14:58 | #21

    Hi En,

    Thank you for your comment…

    I understand that regex can be used… however, I only wanted to explain the concept of web html parsing…

    Once these concepts are mastered – it can easily be applied using regex.

    Regards,
    Suniil

  22. January 12th, 2010 at 04:14 | #22

    great article!

  23. jaredthensomenumbers
    January 27th, 2010 at 23:37 | #23

    regex is a really old way of pattern matching for web parsing, it breaks far too easily.

    you may want to read up about XPath and similar ways of parsing by using the DOM:

    http://php.net/manual/en/class.domxpath.php
    http://code.google.com/p/phpquery/

  24. February 1st, 2012 at 09:43 | #24

    Really impressed by ur way of presenting it. Thanks a lot !!

  1. April 20th, 2008 at 03:08 | #1
Enter this code to leave comment (Sorry, but bots get me crazy :) )