My friend and co-worker John Liu from The World is Meh, has just written a great blog entry about Google’s apparent indexing of webpages despite explicit prohibitions from the site’s robots.txt file. Take a look at his thoughts on the issue.
Is this just a glitch in the Google matrix or the end of the search engine’s gentlemen’s agreement with webmasters? I hope it’s the former, not the latter. I did some testing with my own robots.txt file in the Google Webmaster Tools Robots.txt Analyzer and I was assured by the analyzer that the restricted URLs in the file would not be spidered when the Googlebot visited my site. For the moment, it seems Google is crawling everything and sending pages “blocked” by robots.txt to its notorious “repeat the search with the omitted results included” section of search results.
Geez… Google mistake or not, you have to do better than this. There was a reason I asked you not to spider this file!
Google does not appear to be fully respecting robots.txt right now. I’ve encountered a few cases of this today – including Google’s own Blogger.com.
Checking Blogger’s robots.txt file shows a short list of disallowed URLs:
# robots.txt for http://www.blogger.com
D Disallow: /comment.g
D Disallow: /email-post.g
However, a Google search using the query “site:blogger.com profile find” returns
As you can see, the first result returned is exactly the disallowed URL. Note that it is indexed, but is apparently not cached – there is no search listing snippet.
Although the page is not being cached, the fact that it is being indexed at all shows that Google is not fully respecting Robots.txt! This seems to be a recent development, and hopefully it is just a bug that will soon be patched up, as opposed to a change in Google’s behavior.