Search

Search Bots Behaving Badly

0 views

The Roots of Bot Misbehavior

Back in December 1998 a newsgroup thread turned into a cautionary tale when a user named Steve posted a scathing review of GoogleBot. He called it “the worst written piece of crapware” that crawls sites, noting that it repeatedly fetches robots.txt but then ignores whatever rules it contains. He added that it also ignores meta tags and headers, gets lost in infinite link trees, and can drain bandwidth for days on end. Steve’s post quickly spread through the web, landing on SEO forums and search‑engine roundtables as an example of what can happen when a crawler isn’t polite.

At the time Google was still in its infancy, and bots behaved less like disciplined robots and more like impatient visitors. A bot that keeps following every link, even through complex dynamic applications, can easily fall into a loop. If a page contains a link that points back to itself, the crawler keeps fetching the same URL over and over, each time adding more requests to the server’s queue. For sites with heavy traffic, this kind of endless recursion can push bandwidth limits and slow down legitimate users.

Fast forward a few years, and a similar pattern emerged with Microsoft’s crawler, msnbot. An account from a web administrator named lundens revealed that the bot had already generated 175,000 hits and consumed 3 GB of data in a single month. The site owner felt the strain was becoming unsustainable. Dan Thies, writing for the SEO RoundTable, echoed the sentiment and advised that if a bot keeps hammering a site, the safest move is to block the known MSNbot IP ranges rather than rely on robots.txt alone. In both cases, the crawlers were not following the polite guidelines set out by the web community, and their actions caused real operational headaches.

These stories underline a fact that still holds today: while search engines claim to use crawlers for indexing, the way they interact with your site matters. If a crawler ignores the rules you set or falls into a link loop, the resulting traffic can be costly. It’s not enough to assume that all bots are well‑behaved; you must verify that they respect your site’s constraints. And if they don’t, you need a plan to stop the unwanted traffic before it erodes your resources.

Turning the Tide: Practical Ways to Keep Bots in Check

The first line of defense is a correctly configured robots.txt file. Place it at the root of your domain and list each user‑agent you want to control. If you only want Google’s crawler to avoid certain directories, add a line such as User-agent: Googlebot followed by Disallow: /private/. Remember that robots.txt is advisory - well‑intentional crawlers follow it, but misbehaving bots may ignore it entirely.

For sites that run dynamic content or heavy AJAX, meta tags and HTTP headers can provide an extra layer of instruction. The meta name="robots" content="noindex, nofollow" tag tells compliant crawlers not to index a page or follow its links. Similarly, sending an X-Robots-Tag: noindex, nofollow header in the response allows you to control indexing on a per‑resource basis without editing HTML. These mechanisms give you finer control than robots.txt alone and are especially useful for content that should never appear in search results.

When a bot repeatedly fails to respect your directives, the next step is to block its IP address or range. Google’s IP ranges are published on the Google Search Central help site, and Microsoft’s on Bing Webmaster Tools. By adding rules to your .htaccess or web server configuration, you can return a 403 Forbidden status for those addresses. If you notice a bot that is constantly requesting a particular pattern of URLs, you can narrow the block to that path, preventing a single crawler from monopolizing your bandwidth.

Monitoring is key. Turn on access logging for your server and parse the logs to spot unusual patterns: a single user‑agent that requests thousands of pages in a short span, or repeated requests to the same resource. Tools like Google Search Console and Bing Webmaster Tools provide crawl statistics that let you see how many requests each bot is sending and whether they are hitting errors. If you spot a bot generating excessive traffic or hitting server errors, you can investigate its behavior in more detail.

Finally, consider rate limiting at the application level. Many frameworks let you throttle requests based on user‑agent, IP, or request frequency. For example, you could set a rule that any bot is allowed to fetch a maximum of 10 pages per minute. If the bot exceeds that threshold, the server returns a 429 Too Many Requests status. Over time, this gentle nudge can teach bots to slow down, or force you to block them outright if they persist.

By combining a solid robots.txt, meta tags, IP blocking, log analysis, and rate limiting, you create a multi‑layered defense that keeps unruly crawlers from draining your resources. The key is to act quickly when you detect abnormal traffic and to keep your directives clear and up to date. Search bots are meant to help you reach audiences; they shouldn’t become unintentional traffic bombs that hurt your site’s performance or cost you money.

Suggest a Correction

Found an error or have a suggestion? Let us know and we'll review it.

Share this article

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Related Articles