Case study: Fixing “Indexed, though blocked by robots.txt”

When we decided to work on the “Indexed, though blocked by robots.txt” issue in Google Search Console for zamnesia.com, I didn’t think we would get a clean sheet, but here we are:

From almost 3k “Indexed, though blocked by robots.txt” URLs to zero!
We actually started with 6.5k URLs, but I don’t have a screenshot to prove it 😉

In this case study, I discuss the different URL types we came across in the “Indexed, though blocked by robots.txt” report and how we tackled each one of them.

I will also explain why blocking URLs via robots.txt is not always a good idea and I will present alternatives to blocking URLs via robots.txt.

Let’s be clear: Having zero “affected pages” in the “Indexed, though blocked by robots.txt” report is not an achievement in itself, as this could be pulled off by simply not blocking any URLs at all in the robots.txt file: URLs that are not blocked by robots.txt cannot be “Indexed, though blocked by robots.txt”.

The interesting part of this case study is what we did instead of blocking URLs via robots.txt.

In addition to the above, here are some more things that you will learn in this article:

  • Blocking URLs via robots.txt is not the only solution for saving crawl resources and it does not prevent URLs from being indexed.
  • A URL that redirects to a “noindex” URL is considered a “noindex” URL by Google.
  • Redirected URLs are considered as ‘not indexed’ and fetching a URL and receiving a 30X status code does not require many resources on Google’s end.
  • Internal links can be masked, so that they continue to work for users, but are not followed by Google.

Still interested? Let’s dive right in!

An e-commerce site with typical e-commerce SEO challenges

Zamnesia is an international e-commerce website with thousands of URLs and it faces lots of technical SEO issues that are typical for big online shops.

Here is a list of some of the URLs types that were showing up as “Indexed, though blocked by robots.txt” in GSC, most of which you are probably familiar with, if you’ve ever worked on an e-commerce site:

  • “Add to cart” URLs
  • “Add to wishlist” URLs
  • Paginated review pages
  • Filtered category URLs
  • Internal search result pages
  • Old URLs with session IDs
  • Cart and checkout URLs

All of the above URL types had been added to the robots.txt file in the past because they were not supposed to be crawled. So far so good.

But was blocking them via robots.txt really a good decision? Apparently not, if thousands of those URLs were indexed, even if they were not being crawled.

Why do we normally block URLs via robots.txt?

Blocking URLs via robots.txt is a popular way to save crawl resources and to make sure that crawlers can focus their crawling on important pages. But a major issue with this method is that URLs that were already indexed before they were blocked will remain indexed.

Also, URLs that are blocked via robots.txt can still be indexed without being crawled, which sometimes occurs when they have links (internal or external) pointing to them.

So robots.txt really only solves the crawl resource problem, but it has no direct impact on indexing.

Do we really need to worry about “Indexed, though blocked by robots.txt”?

Google Search Console itself is pretty clear about this. In the “Indexed, though blocked by robots.txt” report, it says: “This is a non-critical issue”.

The main reason why we would want to fix this issue is to prevent URLs that are “Indexed, though blocked by robots.txt” from showing up in Google’s search results. In most cases we can assume that we don’t want URLs that are blocked from crawling to be indexed and to show up in the SERPs.

When a URL that is “Indexed, though blocked by robots.txt” shows up in Google’s search results, it looks like this example from the Dutch version of Zamnesia on zamnesia.nl, with a minimal snippet that claims that there is no information available for the page:

This is the only URL on Zamnesia that we forgot to unblock. Lucky for us, because otherwise we wouldn’t have a screenshot for this example 😉

If you have many important SEO issues to take care of, this is definitely one that you can leave for later. If, on the other hand, like Zamnesia, your site is already very well optimised, it is a good idea to take care of issues like this one to provide an optimal search experience for your users.

Is there an alternative to blocking URLs via robots.txt?

An alternative approach to making sure that URLs are crawled a lot less frequently and also not indexed is the following:

  1. Remove or mask all internal links pointing to the URLs.
  2. Set the URLs to “noindex” or redirect them permanently (depending on the URL type).
  3. Make sure that the URLs are not blocked via the robots.txt file.

This is basically what we did with all the URL types listed above, and what helped us bring the number of “affected pages” in the “Indexed, though blocked by robots.txt” in GSC down to zero. At the same time, we made sure that we do not waste crawl resources and that we do not index URLs that we don’t want indexed.

Read on to learn about the details of each step for the URL types listed above.

“Add to cart” URLs

On Zamnesia, there is a URL type that adds a product to the shopping cart. Here is an example of such a URL:

https://www.zamnesia.com/cart?add&id_product=6000

When you access this URL with a browser, you are 302 redirected to the shopping cart and the product is added.

In the past, this type of URL was internally linked via the “add to cart” button. The URLs were blocked via the robots.txt file and the internal link was set to “nofollow” (which is generally not a good idea for internal links), but this did not keep Google from indexing these “add to cart” URLs.

We then made the following changes:

  • Remove internal links to “add to cart” URLs from “add to cart” buttons: The “add to cart” button still adds products to the cart, but it no longer contains an <a> element that links to an “add to cart” URL.
  • Remove the rules from the robots.txt file that block “add to cart” URLs.

By removing all internal links that point to this URL type, we made sure that Google does not discover new URLs of this type. The internal links were also a strong indexing & crawling signal for the URLs of this type, which was now eliminated.

Next, we needed to decide how to make sure that Google also removes the URLs from the index. We found out that the site already handled requests from user agents that do not store cookies differently from user agents that do stroe cookies:

  • User agent stores cookies: 302 redirect to cart page and product is added to cart.
  • User agent does not store cookies: 301 redirect to product page.

Googlebot falls into the second category of user agents, so we decided to leave this setting as it was. Pages with redirect are normally not considered as indexed by Google: The URLs now actually show up as “Not indexed > Page with redirect” in GSC.

So we managed to remove this URL type from the index, but what about the crawl resources that we wanted to save by blocking the URLs via robots.txt?

Google now does re-crawl the URLs once in a while, as it does with all known URLs that are not blocked via robots.txt. We’re fine with this though, as requesting a URL and getting back a 301 status code does not require many resources. Also, we can expect Googlebot to go easy on URLs that are not indexed due to a permanent redirect and that do not have any internal links pointing to them.

“Add to wishlist” URLs

The next URL type has a lot in common with the “add to cart” URLs discussed above. “Add to wishlist” URLs used to be linked internally on Zamnesia and they would redirect a user to the wishlist and add a product to it. Here is what this URL type looks like:

https://www.zamnesia.com/wishlist?id_product=6000&action=add_product

We made the same changes here that we made for “add to cart” URLs:

  • Remove internal links to “add to wishlist” URLs from “add to wishlist” icons and links: The “add to wishlist” icons and links still add products to the wishlist, but they no longer contain <a> elements that link to “add to wishlist” URLs.
  • Remove the rules from the robots.txt file that block “add to wishlist” URLs.

An important difference between “add to cart” and “add to wishlist” URLs on Zamnesia is that the feature that adds a product to the wishlist when the URL is requested by a browser was also retired. Now, “add to wishlist” URLs simply 301 redirect to the main wishlist URL, which is set to “noindex” via the robots meta tag:

https://www.zamnesia.com/wishlist
<meta name="robots" content="noindex,nofollow">

This behaviour is identical for all user agents, so Googlebot now receives a 301 to a URL that is set to “noindex” when it requests an “add to wishlist” URL.

After some consideration, we decided that this was good enough for our purposes and not to make any further changes. The “add to wishlist” URLs now show up as “Not indexed > Excluded by ‘noindex’ tag” in GSC. Google treats URLs that 301 to a page that is set to “noindex” as if they were set to “noindex” themselves.

Again, we can expect Googlebot to crawl these URLs less frequently, as they are no longer indexed and they have no internal links pointing to them.

A quick side note: Interestingly, we have managed to remove the first two page types from the index without adding a single “noindex” tag.

Paginated review pages

Most products on Zamnesia receive lots of reviews from customers. Reviews are embedded into product pages, but for each product there is also a separate review page that lists all reviews, like this example:

https://www.zamnesia.com/6000-zamnesia-seeds-runtz-feminized.html/reviews

The 50 latest reviews are shown on the main page, but all additional reviews are organised in a pagination series:

https://www.zamnesia.com/6000-zamnesia-seeds-runtz-feminized.html/reviews?iPage=2
https://www.zamnesia.com/6000-zamnesia-seeds-runtz-feminized.html/reviews?iPage=3

In the past, we felt that the crawl resources that were going into crawling all of this user-generated content could be used better in other places, so we blocked all paginated pages except for the first one via the robots.txt file.

It was enough for us to have one page rank for search queries like product name + reviews, but we did not think that all reviews, especially the older ones, had to be crawled and indexed.

The problem with this approach was, of course, that a robots.txt block only prevents crawling, but not indexing, so the paginated review pages remained indexed and ended up in the dreaded “Indexed, though blocked by robots.txt” report in GSC.

Here are the changes we made to fix this:

First, we masked the “next” and “previous” links between the paginated review pages, so that they are no longer followed by Google:

<div class="pagination_button" onclick="if (!window.__cfRLUnblockHandlers) return false; javascript: document.location.href='/6000-zamnesia-seeds-runtz-feminized.html/reviews?iPage=2'">Next »</div>

If you work in technical SEO, you might find this slightly funny, as it is not uncommon to come across such a “link” on a website and to have to ask the dev team to change it into a proper link that search engines can follow. In this case, we want to achieve the exact opposite.

Now that we have no more internal links pointing to paginated review pages, we can set them to “noindex” via the robots meta tag and lift the robots.txt block.

The paginated review pages now show up as “Not indexed > Excluded by ‘noindex’ tag” in Google Search Console – Mission accomplished.

Filtered category URLs

This is a URL type that probably exists on most e-commerce sites and has caused headaches for many technical SEO teams.

On Zamnesia, category filters used to link to URLs with parameters that reflected the selected filters, like this example:

https://www.zamnesia.com/33-headshop/38-bong?filter&material=metal

The links to these URLs were set to “nofollow” (again, using “nofollow” on internal links is never a good idea) and the URLs were blocked via the robots.txt file. You already know what this means: They were indexed all the same.

We made changes that are very similar to some of the ones described for previous URL types:

  • Mask the filter links so that they are no longer followed by search engines: A click on a filter will still take the user to the filter URL with parameters, but there is no longer an <a> element that links to the URL.
  • Set the URLs to “noindex” via the X-Robots-Tag HTTP header – We picked this method instead of the meta robots tag in the HTML because some old filter URLs had been redirected in the meantime and we wanted one easy solution for all filter URLs. A URL that redirects to another URL cannot have a meta robots tag, but it can be set to “noindex” via the HTTP header.
  • Remove the rules from the robots.txt file that block filtered category URLs.

Again, mission accomplished: All filtered category URLs now show up as “Not indexed > Excluded by ‘noindex’ tag” in GSC.

Internal search result pages

Internal search result pages were not internally linked on Zamnesia, but some still showed up as “Indexed, though blocked by robots.txt” – Maybe because they were linked internally in the distant past, or externally from other websites.

Here is an example of an internal search result URL on Zamnesia:

https://www.zamnesia.com/search?search_query=gorilla+glue&orderby=position&orderway=desc

The required changes in this case were minimal:

  • Set internal search result pages to “noindex” via the robots meta tag.
  • Remove the rules from the robots.txt file that block internal search result pages.

Internal search result pages now show up as “Not indexed > Excluded by ‘noindex’ tag” in Google Search Console.

Old URLs with session IDs

At some stage in the past, like many e-commerce sites that have been around for a while, Zamnesia used URLs with session IDs:

https://www.zamnesia.com/blog-276p53cannabis-c16?SID=6boejop0t6qeqciknbfpng6gpm

Session IDs are an SEO nightmare, as a new one is normally generated for each session (hence the name) and they are often appended to internal links. This can result in a very high number of URLs with different session IDs for the same page.

Luckily, session ID URLs are a thing of the past on Zamnesia, but some of them are still known to Google or linked externally from other websites.

In order to remove this URL type from the “Indexed, though blocked by robots.txt” report, we decided to simply remove the blocking rules from the robots.txt file, as all session ID URLs were already 301 redirecting to their equivalents without session IDs.

Just like “add to cart” URLs, they now show up as “Not indexed > Page with redirect” in GSC.

Cart and checkout URLs

Cart and checkout pages are essential page types in every online store and they normally receive internal links from all other pages across the site.

As an attentive reader of this case study, you can already guess what we did to make cart and checkout pages disappear from the “Indexed, though blocked by robots.txt” report in GSC:

  • Mask all internal links to cart and checkout pages, so that they continue to work for users, but are not followed by Google.
  • Set cart and checkout pages to “noindex”.
  • Remove the rules from the robots.txt file that block cart and checkout URLs.

Cart and checkout URLs now show up as “Not indexed > Excluded by ‘noindex’ tag” in Google Search Console.

Summary: What to do instead of blocking URLs via robots.txt

It would be repetitive to go into detail about even more URL types that showed up in the “Indexed, though blocked by robots.txt” report in GSC, as it would all boil down to the following advice:

  • Make sure that the URLs in question are not internally linked. If they have to be internally linked for users, mask the links, so that search engines don’t follow them.
  • Set the URLs to “noindex” or redirect them to targets that are either supposed to be indexed, or that are set to “noindex” (depending on the case).
  • Lift the robots.txt block for the URLs in question.

By following these steps, you will make sure that the URLs are no longer indexed. You will not prevent them from being crawled, but you can expect Googlebot to crawl them a lot less frequently than important URLs, as they are no longer indexed and have no internal links pointing to them.

That’s it! If you have any questions or thoughts to share, please leave a comment under this article.

I would like to thank the amazing team at Zamnesia for allowing me to publish this case study, the brilliant developers who worked on this project for implementing all suggested changes, and SEO legend Esteve Castells for providing valuable feedback that helped me improve this article.

30 responses

  1. Regarding “Internal search result pages”, have you checked whether these pages aren’t bringing you traffic? I read an article a few weeks ago about “searchdexing” and thought it was a good idea… even if you have to consider cannibalization before implementing it.

    1. Hi Didier,

      I believe it’s a great idea to use internal search data to create new indexable pages about topics that users searched for, but didn’t find on your website. In most scenarios, I would not recommend indexing internal search results though.

      Thank you very much for your comment!

      Eoghan

  2. Now that they have dropped from the index, can you not go back and apply the robots.txt directives?

    1. Hi Chris,

      Thanks for your question. I have considered it, but I don’t think I would want to. My goal is to optimise the site so that it does not need to block many URLs via the robots.txt file. The solutions described in this article are not perfect, but I would prefer to continue to work in this direction, rather than going back to blocking URLs from crawling. Using robots.txt often feels like you’re fighting symptoms, instead of fixing the root cause of the problem. Ideally, I would want a site that simply doesn’t generate any URLs that I don’t want to be crawled, indexed, etc.

      Best regards,

      Eoghan

  3. Super insightful case study!

    Q1: How do you noindex all internal search results pages since there could be a large number of URL combinations?

    Q2: The wishlist URLs redirect to the main wishlist page that still has the previously added products for a user, right? It doesn’t refresh their WL.

    1. Hi Junaid,

      Thanks, I’m glad you found it insightful.

      A1: You can normally add a “noindex” robots meta tag to a page template. I assume most sites would use the same template for all internal search result pages, so you don’t have to edit them one by one.
      A2: You’re correct, the wishlist URLs in this example do not work like they used to, so they do not add products to the wishlist anymore.

      Best regards,

      Eoghan

  4. Rémi Nestasio Avatar
    Rémi Nestasio

    Hi,

    Interesting article, thanks!
    These are classic problems for large e-commerce businesses.

    You are talking about “hiding internal link”, but wich method did you use or recommand?

    1. Hi Rémi,

      Thanks for your comment, I’m glad you found the article interesting. I am not able to recommend a specific method, but the selected method should meet the following requirements:
      – No HTML link (HTML “a” element) pointing to the URLs in question (e.g., in the case of the “add to cart” buttons above, the link was replaced with a form).
      – Ideally, the URL should not even be included in the HTML (pre- or post-rendering) anymore, so the “onclick” example for the paginated review pages above is not a 100% optimal solution (although I currently don’t have any evidence that Google uses this type of element for URL discovery).

      Best regards,

      Eoghan

  5. Ahmet Çelik Avatar
    Ahmet Çelik

    What, do I remove the relevant commands from robots.txt so that it is not indexed? What does masking the links mean?

    1. Hi Ahmet,

      Yes, the idea is to remove the rules from the robots.txt that block the URLs in question, but only after making sure that there are no longer any internal links pointing to them and that they are either redirected or set to “noindex”.

      “Masking links”, in this context, means to change the links so that they continue to work for users but are no longer recognised as links by search engines and that the source code ideally does not contain the URLs any longer.

      Best regards,

      Eoghan

  6. Good read!
    Question: you recommend to add a no index tag, however, if you put a no index tag on a page that is blocked by robot txt, Google won’t see the no index tag as the page is being prevented from crawling by the robots.txt. Isn’t it?

    1. Hi Nadav! Thanks for your question. My recommendations also include removing the robots.txt block. You’re right, if the URL remained blocked via robots.txt, then the “noindex” directive would not be detected.

  7. Hi Eoghan, thanks for sharing your tips. Could you please say more about your comments on user agents, cookies and the 301 or 302 redirect behavoiur? Thanks. What do you mean by you found out that the site already handled requests from user agents that do not accept cookies differently from user agents that do accept cookies in relation to the ‘add to cart’ pages and Googlebot and/or others that do not accept cookies being 301 redirected to the product page. Could you show this in any way via screenshots or perhaps say a bit more about this? Thanks.

    1. Hi Emer! Thanks for your interesting questions.

      After thinking about your question about user agents accepting cookies, I decided to change the wording in that paragraph from “accept” cookies to “store” cookies, as I realised that “accept” might be confused with accepting cookies in a cookie consent banner context.

      In this case, it not about cookie consent banners, but about whether a user agent that requests a URL stores cookies or not. A web browser, in its default settings, normally does store cookies, but Googlebot does not. You can change that behaviour in your browser’s settings, and if you use a crawling tool like Screaming Frog, you can also decide whether you want it to store cookies or not (the default setting in Screaming Frog is not to store cookies).

      Some websites might treat user agents that do not store cookies differently from ones that do, especially if the feature in question does not work without cookies. This is a rare case, but if you request the “add to cart” example URL form the article in your browser and check the status code in the “Network” tab of your browser’s developer tools, you will see a 302 status code. If you check the same URL in Screaming Frog, you will see a 301 status code. You can then change the settings of your browser to not store cookies and change the Screaming Frog setting to store them, and you will see that the status codes will be reversed (301 in your browser and 302 in Screaming Frog).

      As we know that Googlebot does not store cookies and thus gets a 301 status code on the URLs in question, which is exactly what we want in this case, we decided that there was no further need for action here.

      Please let me know if you have any further questions here.

      Best regards,

      Eoghan

  8. Hey Eoghan, could you say something about adding both the and the x-robots noindex methods together on the same URLs e.g. on the filter URLs. Is this to cover all search engines? Is it necessary to add both methods?Thank you.

    1. Hi Emer,

      No, it is not necessary and not recommended to use both the HTTP header method and the HTML method to add a “noindex” directive to the same URL. I would recommend to pick only one of them and the only difference that I am aware of is that you can only include “noindex” in the HTML if there is an HTML response – Which is not the case for URLs that redirect or for other file types like PDFs, images etc. – In those case, it makes sense to use the HTTP header.

      Best regards,

      Eoghan

  9. Hello Eoghan,

    This is an interesting case study and I recognise myself in trying to lower down the reported number of entries in the GSC reports :p

    I have one question: did you notice any improvement in terms of ranking or traffic that you could associate to your hard work on non-indexable pages you managed to remove from Google index using the different means you described?

    Sometimes we spend a lot of time and efforts on such tasks which don’t necessarily produce a nice return on investment. Did you notice anything in your SEO traffic after your changes?

    Thanks again for sharing your experience!
    Xavier

    1. Hi Xavier,

      Thank you very much. I’m glad you found it interesting.

      I would not expect a direct impact on SEO performance from an optimisation like this and even if there was, it would be very difficult to measure.

      Best regards,

      Eoghan

  10. Hi Eoghan,

    One of the best articles I’ve read in a long time. I recently did almost the same process for a client and ended up with 0 “Indexed, though blocked by robots.txt” but 19.8K “Excluded by ‘noindex’ tag” URLs.

    I thought that Google would drop these URLs after some time but after some months they are still in SC.

    Any insight in removing them?

    1. Hi Luis,

      Thank you very much. I’m glad you liked it.

      I would expect the URLs to just stay with the status “Excluded by ‘noindex’ tag”. Google never seems to completely “forget” about a URL. As long as there are no internal links pointing to ‘noindex’ URLs, I guess it’s fine.

      Best regards,

      Eoghan

  11. Excellent post. Is there any data on the impact this may have had on crawl, indexation and/or ranking of the website’s URLs?

    1. Hi Daniel,

      Thank you very much. The URLs in question were not crawled at all before, as they were blocked by robots.txt, and now they are crawled once in a while. They are no longer indexed and they do not rank (which is intended).

      There is no measurable impact on the performance of other URLs on the website, but this is not to be expected with an optimisation like this one.

      Best regards,

      Eoghan

      1. Shouldn’t this at least improve crawl of other URls (by reducing wasted crawl budget)?

        1. It’s not really an improvement for crawl resource waste – Blocking URLs via robots.txt is already very efficient at that, as it stops crawling entirely. This case is about an alternative approach to saving crawl resources that has fewer negative side effects. From my point of view, the biggest advantages with this approach are avoiding blocked URLs from showing up in the SERPs and, more importantly, no longer having internal links point to URLs that we don’t want to be crawled or indexed (or both).

  12. this is the most comprehensive case study about this error and crawl budget issues related to this error. What method did you used for masking urls? can you provide some resources about this?

    1. Thank you, Salman.

      I don’t know of any available resources about masking links that I could recommend. Please see this comment for some general requirements.

      Best regards,

      Eoghan

  13. Nice! I cam across this article because I’m experiencing the same issue, but on the opposite direction. The pages I see under ״Indexed, though blocked by robots.txt״ are those that I’d like to index (e.g. even the homepage appears there) and show on Google – so what happens now is that I do see them on Google, but see this notorious “No information is available for this page.” message (also without the image, reviews, …). I did check, and Google robots.txt does not block these pages for sure (I used their checker tool in order to verify). Any idea what it could be? some pages are displayed ok, but not all of them. It’s an e-commerce site with multilingual enabled. The homepage in other languages (e.g. French, foo.com/fr) is displayed ok on Google, but the default one (e.g. foo.com) is not. I use “hreflang” attributes in order to distinguish between the langs, if it even relates.

    1. Hi R! This sounds like a strange case indeed. If GSC shows the pages in question under “Indexed, though blocked by robots.txt” and if they also show up in the SERPs with the note “No information is available for this page.”, then it is very likely that Google does interpret them as blocked via robots.txt for some reason – even if the GSC robots.txt testing tool suggests otherwise. If you like, I’d be happy to have a closer look. You can send me an email at eoghan at rebelytics dot com.

  14. Why it is not a good idea to nofollow internal links?

    1. Hi Sanjay! Thanks for your question.

      I believe that using “nofollow” on internal links is a bad idea in this context (preventing linked URLs from being crawled and indexed), as it is a pretty weak signal that might just be ignored by search engines.

      Generally speaking, it is also a bad idea to use “nofollow” on internal links as that’s not really what it’s meant to be used for – It was originally introduced for links between different websites.

      Whenever there’s a reason to use “nofollow” on internal links, there’s normally a better and more reliable solution.

      That being said, using “nofollow” might work in some cases, like in this great case study by Adam Gent, but I’d still prefer a solution that works reliably in all cases.

Leave a Reply

Your email address will not be published. Required fields are marked *