Okay, I can't wait any longer. I've had this in the works for a couple months and now that Andy started a similar thread I am forced to release my findings.
There's a w3c states there can be only one robots.txt on each site and it must reside in the root directory otherwise it will be ignored.
But wait, there is a META Robots tag that can be implemented on individual pages, doesn't that solve the problem? In short, no.
Even if a web site owner can change the tag and tell
a spider to not index a particular page, it can't be applied to a specific domain since each alias is serving up the exact same content. Every aliased domain uses the same files on every site hosted in that virtual directory. This can be changed programatically but again: most site owners don't have access to the necessary tools or resources to accomplish such a task.
The syntax is and the allowed list of terms in content are Update: I've been told from an undisclosed Google source that less than 2% of Google's index consists of duplicate content. Even with that seemingly small number it equates to 160,000,000 pages containing duplicate content in Google's 8,000,000,000 page index. I think that's enough data alone to raise a few eyebrows.
Once we have accurate data we can show, without a shadow of a doubt, that this change is a vital and necessary step in the evolution of the internet.
One Possible Solution
I propose an additional file be created that resides in the root directory of a web site. The file name is not important (linkreform.txt anyone?) but it's functionality is critical. This file should address the single biggest issue facing search relevance: duplicate content, and the effects duplicate content is having on the search industry as a whole.
Possible Syntax A
=====================================
# linkreform.txt for http://www.mainurl.com/
#
# $Id: linkreform.txt,v 1.01a 2004/11/15 01:33:07 jdowdell
#
# Identify Main URLs That Should Be Crawled
#
# Main url preferred crawler starting point and to be used #in search results
Parent-Domain: www.mainurl.com
# First Alias - non www version of url
Alias-Domain: mainurl.com
# Second Alias - .net version of url
Alias-Domain: www.mainurl.net
Alias-Domain: mainurl.net
# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl.com
Alias-Domain: aliasurl.com
# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl.net
Alias-Domain: aliasurl.net
# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl-a.com
Alias-Domain: aliasurl-a.com
# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl-a.net
Alias-Domain: aliasurl-a.net
=====================================
Here's another proposed doc format that simply states the main url the spiders should crawl and allows users to anonymously own and point several domains to the same place without giving their competitors any information about their aliases.
Possible Syntax B
=====================================
# linkreform.txt for http://www.mainurl.com/
#
# $Id: linkreform.txt,v 1.01b 2004/11/15 01:38:11 jdowdell
#
# Identify Main URLs That Should Be Crawled
#
# Main url preferred crawler starting point and to be used #in search results
Parent-Domain: www.mainurl.com
The easiest implementation would be for the W3C to ammend the robots.txt specification and allow the following line to be added to the file.
Possible Syntax C
=====================================
# robots.txt for http://www.mainurl.com/
#
# $Id: robots.txt,v 1.01b 2004/11/15 01:41:23 jdowdell
#
# Main url preferred crawler starting point and to be used #in search results
User-agent: *
URL-to-crawl: www.mainurl.com
By implementing this new standard we could...
Found an error or have a suggestion? Let us know and we'll review it.
Suggest a Correction
Link Reform To The Rescue
0 views
Comments (0)
Please sign in to leave a comment.





No comments yet. Be the first to comment!