Search

Search Engine Spiders Lost Without Guidance - Post This Sign!

0 views

Robots.txt Signpost Warns Trespassers From Private Property The robots.txt file is an exclusion standard required by all web crawlers/robots to tell them what files and directories that you want them to stay OUT of on your site. Not all crawlers/bots follow the exclusion standard and will continue crawling your site anyway. I like to call them "Bad Bots" or trespassers. We block them by IP exclusion which is another story entirely. This is a very simple overview of robots.txt basics for webmasters. For a complete and thorough lesson, visit http://www.robotstxt.org/ To see the proper format for a somewhat standard robots.txt file look directly below. That file should be at the root of the domain because that is where the crawlers expect it to be, not in some secondary directory. Below is the proper format for a robots.txt file -----> User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /group/ User-agent: msnbot Crawl-delay: 10 User-agent: Teoma Crawl-delay: 10 User-agent: Slurp Crawl-delay: 10 User-agent: aipbot Disallow: / User-agent: BecomeBot Disallow: / User-agent: psbot Disallow: / --------> End of robots.txt file This tiny text file is saved as a plain text document and ALWAYS with the name "robots.txt" in the root of your domain. A quick review of the listed information from the robots.txt file above follows. The "User Agent: MSNbot" is from MSN, Slurp is from Yahoo and Teoma is from AskJeeves. The others listed are "Bad" bots that crawl very fast and to nobody's benefit but their own, so we ask them to stay out entirely. The * asterisk is a wild card that means "All" crawlers/spiders/bots should stay out of that group of files or directories listed. The bots given the instruction "Disalow: /" means they should stay out entirely and those with "Crawl-delay: 10" are those that crawled our site too quickly and caused it to bog down and overuse the server resources. Google crawls more slowly than the others and doesn't require that instruction, so is not specifically listed in the above robots.txt file. Crawl-delay instruction is only needed on very large sites with hundreds or thousands of pages. The wildcard asterisk * applies to all crawlers, bots and spiders, including Googlebot. Those we provided that "Crawl-delay: 10" instruction to were requesting as many as 7 pages every second and so we asked them to slow down. The number you see is seconds and you can change it to suit your server capacity, based on their crawling rate. Ten seconds between page requests is far more leisurely and stops them from asking for more pages than your server can dish up. (You can discover how fast robots and spiders are crawling by looking at your raw server logs - which show pages requested by precise times to within a hundredth of a second - available from your web host or ask your web or IT person. Your server logs can be found in the root directory if you have server access, you can usually download compressed server log files by calendar day right off your server. You'll need a utility that can expand compressed files to open and read those plain text raw server log files.) To see the contents of any robots.txt file just type robots.txt after any domain name. If they have that file up, you will see it displayed as a text file in your web browser. Click on the link below to see that file for Amazon.com http://WebSite101.com and blogs about SEO at

Found an error or have a suggestion? Let us know and we'll review it.

Share this article

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!