Search

Controlling Search Engine Spiders

1 views

Managing Robots with robots.txt

When you build a website, you often have pages that are still in draft, pages that are duplicate, or sections that just don’t fit the overall theme of the site. Search engines crawl everything by default, which can waste crawl budget and expose unfinished content. The quickest way to keep bots out of the places you don’t want them is the robots.txt file. It lives in the root folder of your domain (for example, https://www.yoursite.com/robots.txt) and is read by any crawler before it starts indexing your pages. Below is a practical guide to writing and fine‑tuning a robots.txt file that will keep unwanted bots at bay and give you more control over how search engines view your site.

At its core, a robots.txt file is a plain‑text document that contains rules in a very simple syntax. The format looks like this:

Prompt
User-agent: *</p> <p>Disallow:</p>

The first line declares the crawler the rule applies to. The asterisk (*) is a wildcard that matches every user agent. The second line tells the crawler what URLs it cannot visit. An empty Disallow means “allow everything.” That file, as it stands, is effectively a no‑op – it tells every crawler that all pages are fine to index.

To restrict a specific part of your site, such as the FAQ section, add a path after Disallow. The trailing slash is important; it tells the crawler that you’re referring to a directory, not a single file. For instance, to block everything under /faq/ you’d write:

Prompt
User-agent: *</p> <p>Disallow: /faq/</p>

This rule is short, readable, and covers all sub‑pages inside the FAQ directory. You can stack multiple Disallow lines to block several directories in the same block. Just list each path on a new line:

Prompt
User-agent: *</p> <p>Disallow: /faq/</p> <p>Disallow: /cgi-bin/</p> <p>Disallow: /images/</p> <p>Disallow: /info/about/</p>

Sometimes you only want to hide a single file, such as a splash page that is not yet ready for public view. In that case, omit the leading slash to indicate a file in the root folder, and provide the full path for any deeper file:

Prompt
User-agent: *</p> <p>Disallow: about.html</p> <p>Disallow: /faq/faqs.html</p>

While the previous examples apply to all crawlers, you might want to target a specific search engine bot. Every major search engine uses a distinct user‑agent string that you can match against. For example, Google’s crawler is known as Googlebot. To block Googlebot from accessing the FAQ while leaving other bots unaffected, write:

Prompt
User-agent: Googlebot</p> <p>Disallow: /faq/</p>

You can combine a site‑wide rule with a more restrictive rule for a particular bot by placing the specific bot’s block before the general one. The crawler reads the file from top to bottom and stops when it finds a matching user‑agent. Here’s an example where Googlebot is blocked from the entire site, while all other bots still see the FAQ directory:

Disallow: /

User-agent: *

Disallow: /faq/

Notice the blank line between the two blocks. The newline is a required separator; without it, the parser would merge the two directives and the file would not work as intended.

Comments are handy for future reference. Anything after a hash (#) is ignored by crawlers and can be used to describe the purpose of each rule. A typical commented file might look like this:

Prompt
# Disallow all bots from the FAQ section</p> <p>User-agent: *</p> <p>Disallow: /faq/</p>

Since robots.txt is just a text file, you can edit it with any plain‑text editor that preserves UTF‑8 encoding. On Windows, Notepad works fine. On macOS or Linux, the default editors or third‑party options like Sublime Text or Visual Studio Code are excellent choices. Avoid rich‑text editors that might embed hidden formatting characters, as those can confuse the crawler.

Suggest a Correction

Found an error or have a suggestion? Let us know and we'll review it.

Share this article

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Related Articles