Googlebot Gets Candid

Googlebot is like a dream which knows us all , , and soul. Here in this interview, Maile Ohye as the website and Jeremy Lilley as the Googlebot from

User-Agent: Mediapartners-Google

Or for image search:

User-Agent: Googlebot-Image/1.0

Wireless fetches often have carrier-specific user agents, whereas Google Reader RSS fetches include extra info such as number of subscribers.

I usually avoid cookies (so no "Cookie:" header) since I don't want the content affected too much by session-specific info. And, if a server uses a session id in a dynamic URL rather than a cookie, I can usually figure this out, so that I don't end up crawling your same page a million times with a million different session ids.

Website: I'm very complex. I have many file types. Your headers say "Accept: */*". Do you index all URLs or are certain file extensions automatically filtered?

Googlebot: That depends on what I'm looking for.

If I'm indexing for regular web search, and I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I will treat it differently than an HTML or PDF link. For instance, JPG is much less likely to change frequently than HTML, so I will check the JPG for changes less often to save bandwidth. Meanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the PDF article than the JPG file. Downloading doodles (like JPGs) and videos of skateboarding dogs is distracting for a scholar—do you agree?

Website: Yes, they can be distracting. I'm in awe of your dedication. I love doodles (JPGs) and find them hard to resist.

Googlebot: Me, too; I'm not always so scholarly. When I crawl for image search, I'm very interested in JPGs. And for news, I'm mostly looking at HTML and nearby images.

There are also plenty of extensions (exe, dll, zip, dmg…), that tend to be big and less useful for a search engine.

Website: If you saw my URL, http://www.example.com/page1.LOL111, would you (whimper whimper) reject it just because it contains an unknown file extension?

Googlebot: Website, let me give a bit more background. After actually downloading a file, I use the Content-Type header to check whether it really is HTML, an image, text, or something else. If it's a special data type like a PDF file, Word document, or Excel spreadsheet, I'll make sure it's in the valid format and extract the text content. Maybe it has a virus; you never know. If the document or data type is really garbled, there's usually not much to do besides discard the content.

So, if I'm crawling http://www.example.com/page1.LOL111 with an unknown file extension, it's likely that I would start to download it. If I can't figure out the content type from the header, or it's a format that we don't index (e.g. mp3), then it'll be put aside. Otherwise, we proceed indexing the file.

Website: My apologies for scrutinizing your style, Googlebot, but I noticed your Accept-Encoding headers say:

Can you explain these headers to me?

Googlebot: Sure. All major search engines and web browsers support gzip compression for content to save bandwidth. Other entries that you might see here include "x-gzip" (the same as "gzip"), "deflate" (which we also support), and "identity" (none).

Website: Can you talk more about file compression and "Accept-Encoding: gzip,deflate"? Many of my URLs consist of big Flash files and stunning images, not just HTML. Would it help you to crawl faster if I compressed my larger files?

Googlebot: There's not a simple answer to this question. First of all, many file formats, such as swf (Flash), jpg, png, gif, and pdf are already compressed (there are also specialized Flash optimizers).

Website: Perhaps I've been compressing my Flash files and I didn't even know? I'm obviously very efficient.

Googlebot: Both Apache and IIS have options to enable gzip and deflate compression, though there's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily compressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search engine crawler) allow it. Personally, I prefer "gzip" over "deflate". Gzip is a slightly more robust encoding — there is consistently a checksum and a full header, giving me less guess-work than with deflate. Otherwise they're very similar compression algorithms.