11 responses

  1. Adam Audette
    February 21, 2012

    Good stuff, Ted. Good timing, too. We recently debugged a weird issue with Adwords instant previews for our client National Geographic and found the culprit was “Google Web Preview.” More here: http://www.rimmkaufman.com/blog/google-preview-user-agent-redirection-curl-command-user-agent/09022012/

    Until now, we are the only ones who’ve done any research here. Thank you for adding even more insight here.


  2. IncrediBILL
    February 22, 2012

    Whitelisting is a *GREAT* idea, far from bad. Been doing it since 2006 with no backlash whatsoever. You just need to monitor activity and add to your whitelist now and then which is no different than monitoring activity to add to your blacklist. Blacklists are the bad idea, you can never protect your content or server resources if you wait until after the damage has happened, it’s REACTIVE opposed to whitelisting which is PROACTIVE and stops the problems before they start.


  3. Ted Ives
    February 22, 2012

    Ah, you’re the CrawlWall guy, I ran across your website recently.

    I wonder whether it’s wise though, since Google clearly tries to figure out if you’re presenting different content to users versus different content to Googlebot; I would think it must have a special version of Googlebot that quietly goes in and pretends to be a normal user, to double-check things.

    Whitelisting could perhaps risk blocking that (presumably with some sort of penalty resulting). Then again, the Fantomaster guy has been around for a long time, maybe there is something to it…

    Feel free to drop a commercial-sounding pitch in a comment here, because I for one certainly don’t “get it” and could probably use some “edumication” on this front.


  4. IncrediBILL
    February 22, 2012

    You could be right, I never claimed to know everything Google does, but I’ve never seen anyone penalized for whitelisting *yet*.

    Maybe when 10s of thousands of sites big and small are all whitelisting we’ll find out for sure when push comes to shove what Google truly does because it’ll become more obvious.

    However, the way I whitelist is I just whitelist and validate the things claiming to be spiders. Other things coming from Google, and there are proxy services involved, are treated on a case by case basis, browser or crawler, so I’m not treating everything exactly the same which could mean avoiding a penalty by blocking the wrong thing.

    Besides, wouldn’t you think if Google were truly checking for bad behavior that they wouldn’t use actual Google IPs would they? I would think Google would check from outside their official network just so it wouldn’t be so obvious.


  5. Yousaf
    March 9, 2012

    This is a brilliant post! Very insightful and to be honest if I were Google I would use Chrome as a renderer, it makes complete sense specially if you take their latest “layout” signal.


  6. Ben Acheson
    March 26, 2012

    Thanks Ted, this is a fantastic post. Google often experiments and mixes things up, especially for mobile devices, etc.

    You have to think carefully about how you treat user agents and the best word to keep in mind is “test!”


  7. Elham
    February 13, 2016

    Recently, we have experienced some ptrtey scary crawl errors on our webmaster tools reports. The crawl errors have been fluctuating from 10,000 to 129,000. We feel helpless because most of the errors are attributable to pages and even directories that Google’s bot recognizes even though they do not exist on our server. We contacted our hosting company thinking that we had beenhacked, but they claim otherwise. It’s like some alien spacecraft hasbeen generating pages and directories for our website that only Google’s crawler can see and because these pages do not exist, a crawl error is generated.Another crawl error we have been having is with a page that we removed from our website. It was a dynamic page that accepted a keyword queries. The removed page still receives thousands of hits from “scrapers” with keywords in Japanese and Swedish languages. When a hit is made to a removedpage we get a crawl error.These “scrapers” cloak their IP addresses and our webmaster tools report the sources of these hits as “unavailable,” so we cannot identify and control those hits by any way available to us.In both of these cases, we are helpless to correct these crawl errors, yet Google may be penalizing us for them. We wonder if anyone else has been experiencing similar situations.Thanks much.Mike


Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top
mobile desktop