Search

Link Reform To The Rescue

0 views

Okay, I can't wait any longer. I've had this in the works for a couple months and now that Andy started a similar thread I am forced to release my findings. There's a w3c states there can be only one robots.txt on each site and it must reside in the root directory otherwise it will be ignored. But wait, there is a META Robots tag that can be implemented on individual pages, doesn't that solve the problem? In short, no. Even if a web site owner can change the tag and tell a spider to not index a particular page, it can't be applied to a specific domain since each alias is serving up the exact same content. Every aliased domain uses the same files on every site hosted in that virtual directory. This can be changed programatically but again: most site owners don't have access to the necessary tools or resources to accomplish such a task. The syntax is and the allowed list of terms in content are Update: I've been told from an undisclosed Google source that less than 2% of Google's index consists of duplicate content. Even with that seemingly small number it equates to 160,000,000 pages containing duplicate content in Google's 8,000,000,000 page index. I think that's enough data alone to raise a few eyebrows. Once we have accurate data we can show, without a shadow of a doubt, that this change is a vital and necessary step in the evolution of the internet. One Possible Solution I propose an additional file be created that resides in the root directory of a web site. The file name is not important (linkreform.txt anyone?) but it's functionality is critical. This file should address the single biggest issue facing search relevance: duplicate content, and the effects duplicate content is having on the search industry as a whole. Possible Syntax A ===================================== # linkreform.txt for http://www.mainurl.com/ # # $Id: linkreform.txt,v 1.01a 2004/11/15 01:33:07 jdowdell # # Identify Main URLs That Should Be Crawled # # Main url preferred crawler starting point and to be used #in search results Parent-Domain: www.mainurl.com # First Alias - non www version of url Alias-Domain: mainurl.com # Second Alias - .net version of url Alias-Domain: www.mainurl.net Alias-Domain: mainurl.net # Additional Alias - completely different domain name Alias-Domain: www.aliasurl.com Alias-Domain: aliasurl.com # Additional Alias - completely different domain name Alias-Domain: www.aliasurl.net Alias-Domain: aliasurl.net # Additional Alias - completely different domain name Alias-Domain: www.aliasurl-a.com Alias-Domain: aliasurl-a.com # Additional Alias - completely different domain name Alias-Domain: www.aliasurl-a.net Alias-Domain: aliasurl-a.net ===================================== Here's another proposed doc format that simply states the main url the spiders should crawl and allows users to anonymously own and point several domains to the same place without giving their competitors any information about their aliases. Possible Syntax B ===================================== # linkreform.txt for http://www.mainurl.com/ # # $Id: linkreform.txt,v 1.01b 2004/11/15 01:38:11 jdowdell # # Identify Main URLs That Should Be Crawled # # Main url preferred crawler starting point and to be used #in search results Parent-Domain: www.mainurl.com The easiest implementation would be for the W3C to ammend the robots.txt specification and allow the following line to be added to the file. Possible Syntax C ===================================== # robots.txt for http://www.mainurl.com/ # # $Id: robots.txt,v 1.01b 2004/11/15 01:41:23 jdowdell # # Main url preferred crawler starting point and to be used #in search results User-agent: * URL-to-crawl: www.mainurl.com By implementing this new standard we could...

  1. Reduce bandwidth of all major search crawlers
  2. Reduce resources needed to power major crawlers
  3. Reduce cost of hosting a web site and demand on individual web site resources.
  4. Reduce the number of pages appearing in a search engine index that are of the same content but on a different domain.
  5. By doing no.4 search engines could focus more fine tuned efforts to thwart the practice of publishing duplicate content as an effort to rank higher.
  6. Increase end-user satisfaction rates by decreasing the amount of noise associated with typical search results.
  7. Facilitate more accurate results across all major search engines by reducing the number of duplicate content pages from non-spammers from their indexes.
  8. Possible Side Effects: Financial and Sociological (both good and bad)
    1. Reduction in the amount of ppc revenue generated by search engines since there would be more relevant results in the natural section.
    2. Conversely, it may increase ppc revenue since results are more accurate.
    3. Society isn't ready for the "less is more" approach just yet since most internet users don't know the difference between natural results and paid listings.
    4. Search engines save money on overhead by using less resources for crawling & web hosting providers save money on bandwidth since less requests would be made.
    5. Could completely backfire and engines that support it could lose face with visitors and advertisers.
    6. Tim Bray had made some recommendations for changes previously as well. He points
      this Related articles: And Another Some more previously proposed changes are good points and basically that simpler is better and the robots.txt is simple. This Marketing Shift reader is my goal and to that goal will I work. Suggestions, Thoughts, Comments If you would like to provide feedback regarding these ideas please just submit a comment on this post for the time being. If there is enough support we'll create a site dedicated to fixing this issue. A few http://www.w3.org/2001/tag/issues.html#siteData-36 ================================================= Update: I had previously stated that Tim Bray had "pledged his support" and I misstated that. My apologies to Tim. Tim said that if he has any good ideas on how to promote this idea he will pass them on. I need to get some sleep. Jason Dowdell is a technology entrepreneur and operates the

      Suggest a Correction

      Found an error or have a suggestion? Let us know and we'll review it.

Share this article

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!