How to not index a URL

#IklanBawahJudul

home interior design ideas, interior design ideas small room, interior design ideas living room, home design ideas, graphic design inspiration, design graphic tutorial

#Iklan

<meta name="robot" content="noindex" /> did you spot it? What looks like an error (and in fact you might argue it is) is actually accepted by Google and will exclude your HTML page from being listed on the search result page. A fellow SEO (@termfrequenz) did discover this oddity after wondering why a page didn’t show in the Google search results but was recognized as indexable by the popular browser plugin Seerobots. As it turns out: Google accepts robot (singular) as well as robots (plural).

How to not index a URL

Don’t get me wrong: I really love Seerobots. It has a great UI and is really lightweight but it has an issue with pages which don’t follow the robots meta tag specifications to the letter - literally. And Seerobots is not alone. Most of the crawlers out there, and all of which I tried, have a discrepancy to the way Google handles issues related to the robots meta tag and the X-Robots-Tag http header.

a growing list of edge cases

After the initial report about the “robot”-discovery I thought: Maybe this isn’t the only edge case out there and as it turns out there are a lot of them. First stop: the Google help pages which I would recommend every webmaster and SEO to read. The Google documentation more or less sticks to the initial notes which were agreed on in 1996 at the Distributed Indexing/Searching Workshop by attendees from Excite, InfoSeek, and Lycos.

If a webpages sticks to the directives documented on the Google help pages and the original notes there are many tools to tell you if your page is eligible for indexing or not. But the moment you stray of the beaten paths (by accident or not) you might find a lot of curious cases.

All in all I did document a list of 13 different noindex-implementations, eight of which I did test with several known crawling tools and services. (see Google Docs)

Google tries to help you

Since Google does not document the behaviour properly, there is no way to be sure. But it looks like Google tries to honor your wishes even if you did not stick to the specifications.

a robot without s: While the meta tag is commonly referred to as robots meta tag, the name attribute can contain different user-agent values. The specification define the global default as robots (with s) but Google also obeys instructions for robot (without s). Consider the following example: <meta name="robot" content="noindex, follow" /> Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag
value instead of content attribute: According to the specifications, the correct attribute for directives is content, but Google does in fact honor the attribute value, which is a common attribute in meta tags as well. Consider the following example: <meta name="robots" value="noindex, follow" /> Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag which is not documented in the specifications.
value and content: If for some crazy reason you decide to setup a meta tag with value- and content-attribute, Google will ignore the value-settings and prefer content. Consider the following example: <meta name="robots" value="noindex, follow" content="index, follow" /> Google Search Console URL Inspection will declare this URL can be indexed which is in line with the specifications.
no comma in between directives: According to the specifications, multiple directives may be combined in a comma-separated list. Since this isn't considered a must, Google will accept a space-seperated list as well. Consider the following example: <meta name="robots" content="noindex follow" /> Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag
misspelling of noindex: Some tools run a simple string search on the robots instructions and this will lead to false discovery of a noindex-instruction. Consider the following example: <meta name="robots" content="noindexa, follow" /> Google Search Console URL Inspection will declare this URL can be indexed which is in line with the specifications.
additional user-agent: While the meta tag is commonly referred to as robots meta tag, the name attribute can contain different user-agent values. The specification define the global default as robots which can be combined with a second meta tag specifically for Googlebot. Some of the tools out there ignore those additional instructions. Consider the following example:
<meta name="robots" content="index, follow" />
<meta name="googlebot" content="noindex" />
Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag which is in line with the specifications.
UPPERCASE: Some developers or CMS decide to use uppercase spelling of HTML. This really doesn't matter. Google and probably all tools really don't care. Consider the following example: <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW" /> Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag which is in line with the specifications.
noindex meta tag in body: This one was a real suprise to me. Meta tags in the body of the HTML will not be ignored by Google. Therefor a noindex-meta-tag in the body does prevent indexing of the page. Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag which seems a little bit strange but still works.
unavailable_after meta tag: I really was hoping this would work with Google (it does not). The robots specification and the Google documentation allow for a deprecation date in the meta tag. Consider the following example: <meta name="robots" content="unavailable_after: 1 Jan 1970 00:00:00 GMT" /> As it turns out: Google Search Console URL Inspection will declare this URL can be indexed and it does get indexed - which is probably an error.

additional fun with http-headers

Some of you might know that not only a html tag can be used to instruct Google and other search engines but a http header as well: X-Robots-Tag.

the X-Robots-Tag HTTP header contradicting the meta tag: An HTTP response with an X-Robots-Tag instructing crawlers not to index a page can be in direct contradiction to the meta tag on page. In those cases crawlers are supposed to follow the stricter noindex directive. Consider the following example:
<meta name="robots" content="index, follow" />
X-Robots-Tag: noindex
Google Search Console URL Inspection will declare this Excluded by ‘noindex’ tag This is correct and expected behaviour since the most strict noindex will be applied.
the X-Robots HTTP header: According to the specification the HTTP header for deploying a noindex is called X-Robots-Tag but sometimes this is referred incorrectly as X-Robotsheader. Consider the following example:
<meta name="robots" content="index, follow" />
X-Robots: noindex
Google Search Console URL Inspection will declare this URL can be indexed which is in line with the specifications.
unavailable_after X-Robots-Tag: The robots specification allows for a deprecation date in the X-Robots-Tag http header. Consider the following example: X-Robots-Tag: unavailable_after: 1 Jan 1970 00:00:00 GMT Google Search Console URL Inspection will declare this URL can be indexed - again as with the html tag a beviour I did not expect.

double check if you want to be sure

The new Google Search Console has a nifty feature called “URL inspect”. It basically allows you to look up if a page is already in the Google index or at least could be in the index. To the best of my knowledge it works exactly like the real indexer and therefore is the most reliable tool everyone should know.

After testing and documenting several cases I did reach out to all the vendors of known crawling tools - all of which were really helpful. Because, as it turns out, there is a difference between what URL Inspect reports and what the tools report. The most common explanation is one where the vendor sticks to the original robots specifications and not the Google interpretation because Google isn't the only search engine and Google might change it's behaviour without notice. And while this in itself is coherent given the fact that Google isn’t the only search engine out there, it provides an interesting dilemma with webmasters relying on Google.

So after some extensive testing and experimenting I would recommend one thing above all: do double check with the search engine you want your pages to be listed in and do not rely the specifications. And if you use a crawling service or tool, which I also would strongly recommend, you should know what the limits are.

recommendation

If you are serious about optimizing your webpage, you do need a tool to help you keep track of all the issues that a website can have. While my experiment did show a lot of interesting edge cases, there are far more common issues which arise day by day and a good tool can help you with. Find the tool which suits your use case the best and try to understand it’s strengths and limits.

This is a list of all the tools in my experiment and you might want to have a look at (in alphabetical order):

Url Indexer

Recents Posts

How to not index a URL

How to not index a URL

a growing list of edge cases

Google tries to help you

additional fun with http-headers

double check if you want to be sure

recommendation