ellipsis flag icon-blogicon-check icon-comments icon-email icon-error icon-facebook icon-follow-comment icon-googleicon-hamburger icon-imedia-blog icon-imediaicon-instagramicon-left-arrow icon-linked-in icon-linked icon-linkedin icon-multi-page-view icon-person icon-print icon-right-arrow icon-save icon-searchicon-share-arrow icon-single-page-view icon-tag icon-twitter icon-unfollow icon-upload icon-valid icon-video-play icon-views icon-website icon-youtubelogo-imedia-white logo-imedia logo-mediaWhite review-star thumbs_down thumbs_up

Google+ Overrides Site Restrictions

Google+ Overrides Site Restrictions Brandt Dainow


Many people have been complaining recently that Google has indexed pages which were explicitly forbidden.  There are two methods for telling a search engine to avoid a page - via a robots.txt file or by a NOINDEX tag in the code.  It seems Google can ignore your guidance and list the pages anyway.

Google have stated:

"When you add the +1 button to a page, Google assumes that you want that page to be publicly available and visible in Google Search results. As a result, we may fetch and show that page even if it is disallowed in robots.txt." - http://www.google.com/support/webmasters/bin/answer.py?answer=1140194


At first glance, this may seem logical: why would you put a +1 button on a page if you didn't want people to find it in a search engine?  However, that many people are complaining indicates that there can be cases where you want people to be able to +1 a page, but not see that page listed in Google.  The reasons why you might want this are many - you may want to show a page to some of your +1 friends, but not the general public; there may be several pages with duplicate content and you're trying to control which gets listed in Google; junior content creators may dump the +1 button in without realising the page is restricted.

I recently saw a very serious case like this.  A client had a number of pages in their private investor relations section detailing a forthcoming merger, something they desperately needed to keep quiet until the deal was complete.  Having these pages behind a login, with NOINDEX tags, and with a robots.txt file restricting access, they assumed they would be safe from public scrutiny.  A minor glitch in changing a content template meant that these pages acquired a +1 button, and Google promptly listed the pages.  Because these pages required a login, you couldn't read the details, but the mere fact you could see their titles in Google was enough to alert the markets to the coming merger, with serious consequences.


One of the stated USP's of Google+ was to offer better privacy than Facebook.  Putting a +1 button on a page does not mean I want Google to list it in their search engine, it means I want people to be able to list it in their +1 page.  When Google assume that everything we +1 should be open and public, they take lack of privacy much further than Facebook ever have.  At least when Facebook dig into people's private data, they restrict it to themselves and their advertising clients.  Google just give it everyone.

This is also demonstrates a complete disregard for internet standards.  The internet is composed of thousands of products from thousands of companies.  The only reason the web works is because everyone abides by a common set of core standards.  Once companies start deciding their innovations can ignore these standards, the internet starts to break down.   These standards are not a block on innovation, nor are they restrictive or fixed forever.  If companies want their innovations to have a place in the internet, or if they're unhappy with the existing standards, there are established mechanisms for changing things via bodies such as IETF, OASIS, and W3C.  If Google think +1 should be able to override robots.txt, they should submit a proposal to IETF, and let it run the course.  IETF is possibly the most democratic institution on the planet, and the fairest venue for any new technical proposal.  Any standard from IETF has universal acceptance.

The standards are how we all understand how everything works.  Google's announcement about +1 overriding robots.txt is buried in a minor FAQ.  How is anyone supposed to know?  I am certain most people don't because there is a vast amount of discussion going on at the moment as people try to work out why their restricted content suddenly appeared in Google.  If Google had put this through the standards bodies, we would have all known in advance.

The reality is that if Google had tried to create an internet standard stating +1 buttons override robots.txt, we all know it would have failed.  No matter how important Google thinks it is, +1 is of little import compared to core internet standards.  However, I doubt that was why they didn't try.  I suspect the real reason is they simply didn't think.  To me this suggests Google don't get it - they don't understand what the web is about.

When Google decide that +1 overrides robots.txt, they say "get stuffed to all of you, we're more important than the rest of the world."

When Tim Berners-Lee invented HTML, he chose not to patent it, or sell it to any vendor.  His reason was that only open, non-proprietary systems could create the connected world we now know as the World Wide Web.  If vendors did their own thing, we'd merely end up with a bunch of isolated competing systems which don't interact.  Adherence to the standards of the internet is critical, and no-one, not even The Great Google, has any business breaking those standards.

Get it right Google - restrictions in robots.txt are absolute, and a search engine should know better.

Brandt is an independent web analyst, researcher and academic.  As a web analyst, he specialises in building bespoke (or customised) web analytic reporting systems.  This can range from building a customised report format to creating an...

View full biography


to leave comments.