Writing Website Scrapers in PHP

This article discusses about how to write a website scraper using PHP for web site data extraction. The concepts taught can be applied and programmed in Java, C#, etc. Basically any language that has a powerful string processing capability. This article will teach you the basics of website scraping. The article will further cover a tutorial to find web ranking from Yahoo.com search engine.

Steps involved to write a scraping program

  1. Visit the URL
  2. Understand the pattern
  3. Validate the structure of pattern on different URLs
  4. Write the program
  5. Test the program using various input parameters




Lets visit each of these steps one bit at a time.

1. Visit the URL

For this tutorial, we will extract Yahoo’s “Today’s Top Searches” section towards the end of their home page (http://www.yahoo.com/).

2. Understand the pattern

Before you begin to write a web scraping program, its important to understand the pattern of the data that you wish to extract. View the page source to understand the pattern. Mentioned below.

The string of text that we should parse is given below:

 
<div id="popsearchbd" class="bd">
<ol start=1><li><a href="r/dy/*-http://search.yahoo.com/search?p=Heidi+Klum&cs=bz&fr=fp-buzzmod">Heidi Klum</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Sarah+Larson&cs=bz&fr=fp-buzzmod">Sarah Larson</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Oscar+Videos&cs=bz&fr=fp-buzzmod">Oscar Videos</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Brad+Renfro&cs=bz&fr=fp-buzzmod">Brad Renfro</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Gary+Busey&cs=bz&fr=fp-buzzmod">Gary Busey</a></li></ol><ol start=6><li><a href="r/dy/*-http://search.yahoo.com/search?p=Barack+Obama&cs=bz&fr=fp-buzzmod">Barack Obama</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Razzie+Awards&cs=bz&fr=fp-buzzmod">Razzie Awards</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Raisin+in+the+Sun&cs=bz&fr=fp-buzzmod">Raisin in the Sun</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Stay+Home+Moms&cs=bz&fr=fp-buzzmod">Stay Home Moms</a></li><li><a href="r/dy/*-http://search.yahoo.com/search?p=Net+Neutrality&cs=bz&fr=fp-buzzmod">Net Neutrality</a></li></ol></div>
</div>

The pattern is that each Search Phrase is enclosed within a

  • tag. Therefore, we should parse everything between of this text piece to get the text.

    3. Validate the structure of pattern on different URLs

    If you are writing a script to fetch data that has pagination, you should remember to validate the structure on 3 - 4 pages before you start developing code. The reason behind doing this is that the presentation of the first page could differ in subsequent pages.

    4. Write the program

    You could use any programming language like Java, C#, PHP, PERL, etc. for this processing. I have used PHP for this example.

     
    //change fso() to f sock open (my blog was causing an error)
    //change fwt() to f write (my blog was causing an error)
    //change fgs() to f gets (my blog was causing an error)
    //change fc() to f close (my blog was causing an error)
     
    $fp = fso("www.yahoo.com", 80, $errno, $errstr, 30);
    if (!$fp) {
        echo "$errstr ($errno)<br />\n";
    } else {
    	$out = "GET / HTTP/1.1\r\n";
    	$out .= "Host: www.yahoo.com\r\n";
    	$out .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12\r\n";
    	$out .= "Connection: Close\r\n\r\n";
     
    	$str = "";
     
    	fwt($fp, $out);
    	while (!feof($fp)) {
    		$str .= fgs($fp, 1024);
    	}
    	fc($fp);
    }
     
    $pos = strpos($str, "<div id=\"popsearchbd\"");
    $pos = $pos + strlen("<div id=\"popsearchbd\"");
     
    if($pos == false) {
    	echo "No information available";
    }
    else {
     
    	while(1) {
    		$pos = strpos($str, "fp-buzzmod\">", $pos);
     
    		if($pos === false) {
    			break;
    		}
     
    		$pos = $pos + strlen("fp-buzzmod\">");
    		$temppos = $pos;
    		$pos = strpos($str, "</a>", $pos);
     
    		$datalength = $pos - $temppos;
     
    		$data = substr($str, $temppos , $datalength);
    		echo $data;
    		echo "\n";
    	}
     
    }

    5. Test the program using various input parameters

    You should test your program for all the parameters that the web page can take. I have experienced change in layout & data based on the parameters that are passed.

    Notes on processing forms and cookies

    Some pages use form data and cookies to render data. In such cases you should remember to check the Request and Response headers and identify what is necessary to get the results that you want. If the page requires a cookie value, you should then use the cookie information in your Request Headers. Look at the note below that I use to inspect Request and Response headers






    Tool to inspect Request and Response Headers

    I use Live HTTP Headers (a plug in for FireFox) to check for Request and Response Headers. Visit http://livehttpheaders.mozdev.org for more details. To install this plug in visit http://livehttpheaders.mozdev.org/installation.html and click on the ‘download it’ link on the latest release. Please read the release notes before installing a particular version.

    Please feel free to use the comments section down to share other tools that you guys use to monitor Request and Response Headers.

    Future Maintenance of the program

    From a maintenance perspective you should monitor the page frequently and re-validate the HTML structure. The reason for this is because nothing in this world is constant and so is the website. Design changes could result in a change in the HTML code. I recommend scheduling this activity for at least once a month.

    This is all that is there in this tutorial. In the next tutorial, I will guide you step-by-step on how to create a program to check for yahoo.com search engine ranking. Use the subscribe form below to keep yourself updated when the tutorial goes live.

    Your email:  
    Subscribe Unsubscribe  

    Would you like to check out some of my PHP5 Tutorials as well?

    del.icio.us Reddit Slashdot Digg Facebook Technorati Google StumbleUpon Windows Live Furl Netscape Yahoo Bloglines Bookmark.it Ask Spurl Diigo

    8 Responses to “Writing Website Scrapers in PHP”

    1. I have written a load() to get the data from an URL in PHP. Its very easy to use.

      To do pattern matching, use Regular expression - they will make the job much easier.

    2. Very nice article.

    3. I enjoyed that article and I believe it answers a lot of questions that many beginners will have.

    4. would have a zip file somewhere where we could try that? thanks …

    5. Thanks for the article. I would also recommend using regular expression to the matching, specifically preg_match.

      Keep up the posts!

    6. […] Bhatia has an article on writing website scrapers in php. His tutorial goes through the basics, and is written with newbies in mind. An excellent stepping […]

    7. This is an amazing article

    8. Amazing article

    Leave a Reply

    Top Internet blogs Programming Blogs - Blog Catalog Blog Directory TopOfBlogs Technology blogs Top Blog Topsites List