Screen Scraping Your Way Into RSS

December 17, 2004

0 views

Screen Scraping Your Way Into RSS

RSS is one the hottest technologies at the moment, and even big web publishers (such as the New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds. If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as it's mostly used to steal content from other websites. I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to the code! Getting the content For this article, we'll use PHPit as an example, despite the fact that PHPit already has RSS feeds (http://www.phpit.net). The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implode(file("", "[the url here]")); IF your web host allows it. If you can't use file() you'll have to use a different method of getting the page, e.g. using the CURL library (<?php // Get page $url = "http://www.phpit.net/"; $data = implode("", file($url)); // Get content items preg_match_all ("/<div class=\"contentitem\"> ([^`]*?) <\/div>/", $data, $matches); Like I said, the next step is to retrieve the individual information, but first let's make a beginning on our feed, by setting the appropriate header (text/xml) and printing the channel information, etc. // Begin feed header ("Content-Type: text/xml; charset=ISO-8859-1"); echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?> "; ?%gt; <rss version="2.0" &nbsp xmlns:dc="http://purl.org/dc/elements/1.1/" &nbsp xmlns:content="http://purl.org/rss/1.0/modules/content/" &nbsp xmlns:admin="http://webns.net/mvcb/" &nbsp xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> &nbsp&nbsp&nbsp&nbsp <channel> &nbsp&nbsp&nbsp&nbsp <title>PHPit Latest Content</title> &nbsp&nbsp&nbsp&nbsp <description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description> &nbsp&nbsp&nbsp&nbsp <link>http://www.phpit.net</link> &nbsp&nbsp&nbsp&nbsp <language>en-us</language> <? Now it's time to loop through the items, and print their RSS XML. We first loop through each item, and get all the information we get, by using more regular expressions and preg_match(). After that the RSS for the item is printed. <?php // Loop through each content item foreach ($matches[0] as $match) { &nbsp&nbsp&nbsp&nbsp // First, get title &nbsp&nbsp&nbsp&nbsp preg_match ("/\">([^`]*?)<\/a><\/h3>/", $match, $temp); &nbsp&nbsp&nbsp&nbsp $title = $temp['1']; &nbsp&nbsp&nbsp&nbsp $title = strip_tags($title); &nbsp&nbsp&nbsp&nbsp $title = trim($title); &nbsp&nbsp&nbsp&nbsp // Second, get url &nbsp&nbsp&nbsp&nbsp preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp); &nbsp&nbsp&nbsp&nbsp $url = $temp['1']; &nbsp&nbsp&nbsp&nbsp $url = trim($url); &nbsp&nbsp&nbsp&nbsp // Third, get text &nbsp&nbsp&nbsp&nbsp preg_match ("/<p>([^`]*?)<span class=\"byline\">/", $match, $temp); &nbsp&nbsp&nbsp&nbsp $text = $temp['1']; &nbsp&nbsp&nbsp&nbsp $text = trim($text); &nbsp&nbsp&nbsp&nbsp // Fourth, and finally, get author &nbsp&nbsp&nbsp&nbsp preg_match ("/<span class=\"byline\">By ([^`]*?)<\/span>/", $match, $temp); &nbsp&nbsp&nbsp&nbsp $author = $temp['1']; &nbsp&nbsp&nbsp&nbsp $author = trim($author); &nbsp&nbsp&nbsp&nbsp // Echo RSS XML &nbsp&nbsp&nbsp&nbsp echo "<item> "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo "\t\t\t<title>" . strip_tags($title) . "</title> "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo "\t\t\t<link>http://www.phpit.net" . strip_tags($url) . "</link> "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo "\t\t\t<description>" . strip_tags($text) . "</description> "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo "\t\t\t<content:encoded><![CDATA[ "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo $text . " "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo " ]]></content:encoded> "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator> "; &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp echo "\t\t</item> "; } ?> And finally, the RSS file is closed off. </channel> </rss> That's all. If you put all the code together, like in the demo script, then you'll have a perfect RSS feed. Conclusion In this tutorial I have shown you how to create a RSS feed from a website that does not have a RSS feed themselves yet. Though the regular expression is different for each website, the principle is exactly the same. One thing I should mention is that you shouldn't immediately screen scrape a website's content. E-mail them first about a RSS feed. Who knows, they might set one up themselves, and that would be even better. Download sample script at http://www.phpit.net/viewsource.php?url=/demo/screenscrape%20rss/example.php *Previously published at http://www.aspit.net and

Suggest a Correction

Found an error or have a suggestion? Let us know and we'll review it.

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!