Of the major disciplines in SEO, architecture is perhaps the most important of all. This is because if you don’t take into account Google crawling behavior and Google can’t properly spider your website and discover your content, then it won’t matter if it’s the right content, whether it’s optimized, or whether you have obtained any links to it. It may as well not exist if Google can’t discover it.
You’re probably familiar with Google’s spider program, Googlebot, but there are many other ways that Google visits your website. This posting will attempt to pull them all together, and raises some open questions about Google crawling behavior for the community because some aspects of Google’s numerous spiders are still relatively mysterious.
Google Crawling Behavior Problem: Whitelisting User-Agents
A client of mine had some issues recently with Google Instant Preview; they had been whitelisting browsers based on user-agent detection, and I was surprised to see that at some point, Google’s bot that grabs the preview of their web pages must have triggered the whitelist.
Their SERP preview images clearly showed a message in the image “Browser not supported”; this was the message displayed by their website, which had to have been triggered by Google’s bot. Even worse, Google Instant Preview was highlighting this large message as the key text on the page. This lends some credence, in my opinion, to Joshua Giardino’s theory that Google is using Chrome as a rendering front-end to their spiders – and it could simply be that for a short time, the Instant Preview bot’s user-agent was reporting something that was not on my client’s whitelist.
What is This Google Instant Preview Crawler Anyway?
Digging into this, I found that Google Instant Preview (although according to Google it is essentially Googlebot) reports a different user-agent, and in addition to running automatically, it appears to update (or schedule an update perhaps) on demand when users request previews. I knew Google had a separate crawler for Google News and a few other things, but had never really thought much about Google crawling behavior as it relates to Instant Preview. To me, if something is reporting a separate user-agent and is behaving differently, then it merits classification as a separate spider. In fact, Google could easily just have one massive program with numerous if-then statements at the start and technically call it one crawler, but the essential functions would still be somewhat different.
This got me wondering…just how many ways does Google crawl your website?
“How do I crawl thee? Let me count the ways…”
It turns out – quite a few. Google lists around 8 or 9 in a few places itself, but I was able to identify *16* different ways Google could potentially grab content from your website, with a variety of different user agents and behaviors.
Arguably the single page fetch ones are technically not complete “crawlers” or “spiders” but they do register in your web server log with a user-agent or referrer string, they pull data from your website, and in my opinion can be considered at least partially human-powered crawlers, since humans are initiating the fetch on demand.
Take “Fetch as Googlebot” for instance – even though it returns the same user-agent as Googlebot, it appears to be a different program, by Google’s own admission:
“While the tool reflects what Googlebot sees, there may be some differences. For example, Fetch as Googlebot does not follow redirects”
One could argue that Google translate is more of a proxy than a spider, but Google is fetching your web page – who knows what it is doing with it? Google clearly states that the translated content will not be indexed, but that doesn’t mean the retrieved web page isn’t perhaps utilized in some way – maybe even indexed again in Google’s main index, faster than it would have been.
Table 1 details what I found. For “SERP Change Timeframe”, items in quotes are from various Google statements and others are rough estimates from my own experience:
The links referencing background information on each of these is below in Table 2, in case you’re interested in drilling into any particular area of Google crawling behavior. It’s important to note that Video, Mobile, News, and Images all have their own sitemap formats. If you have significant amounts of content in these areas (i.e. on the order of thousands of images or hundreds of videos) it’s advisable to consider creating separate sitemaps to disclose that content to Google’s crawlers in the most efficient way possible, rather than relying on having them discover content on their own:
Table 2 – Background Information
Some Interesting Learnings I Took Away From This Research
1.) You Can Easily Provide Your Website in Other Languages
Totally incidental to this posting but *really* neat – in researching the Google Translate behavior, I discovered that Google Translate can provide you widget code to put on your blog – look at the rightmost part of the TOP-NAV menu – Coconut Headphones is now available in other languages! You can get the code snippet for this here if you ‘d like to do the same:
2.) Whitelisting User-Agents is Probably a *Bad* Idea
One takeaway is that, if you think about it, whitelisting is a *very* bad idea, although it’s easy to see why people are oftent tempted to do so. One might use whitelists with the intent of excluding old versions of IE, or perhaps a few difficult mobile clients. One could also easily envision large company QA groups arbitrarily imposing whitelists based on their testing schedule (i.e. for CYA purposes – since it’s impossible to truly test them all browsers).
But I think it’s *far* better to blacklist user-agents than to whitelist them, simply because it’s impossible to predict what Google will use for user-agent strings in the future. Google could change its user-agent strings or crawling behavior at any time, or come out with a new type of crawler, and you wouldn”t want to be left behind when that happens.
It’s important to “future-proof” your website against Google changes – algorithmic and otherwise, so I think on the whole blacklisting is a wiser approach.
Some Interesting Questions Still Unanswered
1.) Is it possible to “game” Google using this information?
…by using the on-demand services to get indexed more quickly? I have seen *many* SEO bloggers talking about how fetch as Googlebot will speed your indexing, but I have found no evidence of anyone who actually performed an A/B test on it with two new sites or pages. What about instant preview, Google Cache, and Sumbit a URL? Does submitting a single URL to Google result in any speedier indexing, or is it only helpful for alerting Google to new websites?
Note: A few days after this article was originally published, Matt Cutts announced new features in “Fetch As Googlebot”; you’re now allowed to instruct Google bot to more quickly index up to 50 individual pages a week, and 10 pages & all the pages they point to per month. So it appears that rather than having people figure out how to “game” the system, Google has given us a direct line into the indexing functionality. Whether any of the other methods listed above have any affect still remains to be seen.
See video here:
2.) How are cached pages retrieved?
…is it similar to Instant Image Previews, where in some cases it’s automatically retrieved based on a crawl, but sometimes it’s initiated on demand when users click on a Cache entry? Is it just Googlebot itself or something else? There is little to nothing on the web or on Google’s support pages about this, other than people noting or complaining about the listed referrer string in their web server logs, and Google doesn’t seem to acknowledge much about this at all, from what I could find.
You’d think that Google’s function of essentially copying the entire web (1 trillion+ documents) and storing them for display from their own website would be a little better documented, or at least, would have been investigated by the SEO community! (I’m hoping I’ve missed something on this one – straighten me out on this point if you know anything, by all means).
Clearly Google moves, sometimes, in mysterious ways. If you run across any additional ways that Google examines your website let me know and I’ll update the tables above. Also – if you know of any quality research or even if you’ve just done some anecdotal tests of your own any of the issues above regarding Google crawling behavior, by all means – let us know your thoughts below!