The vast majority of the Web is intended for human readers. The goal has been to create an online experience for human beings. It is an open and ever growing body of information. This is all great, but it does present some problems. There is just too much there. We aren't sure what information to trust. We can get lost in the Web and waste a lot of time. So we need some software tools to help us, but the information itself is not structured in a way that software can easily deal with. Enter the machine readable Web. The most basic way for software to deal with information on the Web is to simply read the HTML of the pages and "analyze" it. This is what search engines do. They have software agents called spiders that walk the Web and index the pages. They then use various techniques to give us the "best" pages for the search queries we enter. This is helpful and essential, but you still have to go to the pages (many pages) and try to find what you want. And you need to know when to go back to get updated information. You may even know that a page has the information you want and that it will be updated regularly, but you don't want to go back again and again to get that bit of information off that page. There are tools called "screen scrapers" or Web page extractors that can read the pages and extract just the information you want, but the pages are unstructured and changing. The rules you describe for extracting the information may be complex and may not work as the page changes. And content providers often don't want you to use their page that way. They want you to look at the whole page, so that you will get the other messages they have on the page (like marketing messages), not just the bit you want. They try to put up a "no droids allowed" sign, in this case, "no robots, we want human eyeballs only". Some content providers realize that you can't always come to their site and that if they will give you a useful summary of what is on their site, you might come more often to see the details (and the other stuff you really don't want to see, but live with to get the content you want). A very useful way of doing this is using RSS feeds. RSS (Really Simple Syndication) provides the summary in an XML file that a software agent can easily process. RSS news readers or information aggregators go and get the summary for you and then you can see if you want to click through to see the details. (See http://www.pewinternet.org/PPF/r/144/report_display.asp), 5% of internet users are using RSS. Most of these people are classic early adopters. But it seems like RSS is moving quickly to being more widely adopted. But even this relative simple standard was not easy to get to. There was a lot of conflict between the "keep it simple" crowd and the "more features" crowd (see http://itpapers.zdnet.com/whitepaper.aspx?scname=GSM&docid=97767. So a machine readable Web is starting to become a reality with RSS and Web services and may progress even further with something like machine-to-machine or the semantic Web. Early adopter consumers are starting to adopt the idea via RSS. The key will be for content providers to adopt a richer set of machine readable formats like they have started to do for RSS and keeping it as simple as possible so a wide variety of software developers can provide tools for the end users. This may be the key to making the Web even more useful. Ron Tower is the President of Sugarloaf Software and is the developer of Personal Watchkeeper, an information aggregator supporting a variety of ways to summarize the Web. http://www.sugarloafsw.com
The Machine Readable Web
0 views
Comments (0)
Please sign in to leave a comment.





No comments yet. Be the first to comment!