When we decided to work on the “Indexed, though blocked by robots.txt” issue in Google Search Console for zamnesia.com, I didn’t think we would get a clean sheet, but here we are:
In this case study, I discuss the different URL types we came across in the “Indexed, though blocked by robots.txt” report and how we tackled each one of them.
I will also explain why blocking URLs via robots.txt is not always a good idea and I will present alternatives to blocking URLs via robots.txt.
Let’s be clear: Having zero “affected pages” in the “Indexed, though blocked by robots.txt” report is not an achievement in itself, as this could be pulled off by simply not blocking any URLs at all in the robots.txt file: URLs that are not blocked by robots.txt cannot be “Indexed, though blocked by robots.txt”.
The interesting part of this case study is what we did instead of blocking URLs via robots.txt.
In addition to the above, here are some more things that you will learn in this article:
- Blocking URLs via robots.txt is not the only solution for saving crawl resources and it does not prevent URLs from being indexed.
- A URL that redirects to a “noindex” URL is considered a “noindex” URL by Google.
- Redirected URLs are considered as ‘not indexed’ and fetching a URL and receiving a 30X status code does not require many resources on Google’s end.
- Internal links can be masked, so that they continue to work for users, but are not followed by Google.
Still interested? Let’s dive right in!
An e-commerce site with typical e-commerce SEO challenges
Zamnesia is an international e-commerce website with thousands of URLs and it faces lots of technical SEO issues that are typical for big online shops.
Here is a list of some of the URLs types that were showing up as “Indexed, though blocked by robots.txt” in GSC, most of which you are probably familiar with, if you’ve ever worked on an e-commerce site:
- “Add to cart” URLs
- “Add to wishlist” URLs
- Paginated review pages
- Filtered category URLs
- Internal search result pages
- Old URLs with session IDs
- Cart and checkout URLs
All of the above URL types had been added to the robots.txt file in the past because they were not supposed to be crawled. So far so good.
But was blocking them via robots.txt really a good decision? Apparently not, if thousands of those URLs were indexed, even if they were not being crawled.
Why do we normally block URLs via robots.txt?
Blocking URLs via robots.txt is a popular way to save crawl resources and to make sure that crawlers can focus their crawling on important pages. But a major issue with this method is that URLs that were already indexed before they were blocked will remain indexed.
Also, URLs that are blocked via robots.txt can still be indexed without being crawled, which sometimes occurs when they have links (internal or external) pointing to them.
So robots.txt really only solves the crawl resource problem, but it has no direct impact on indexing.
Do we really need to worry about “Indexed, though blocked by robots.txt”?
Google Search Console itself is pretty clear about this. In the “Indexed, though blocked by robots.txt” report, it says: “This is a non-critical issue”.
The main reason why we would want to fix this issue is to prevent URLs that are “Indexed, though blocked by robots.txt” from showing up in Google’s search results. In most cases we can assume that we don’t want URLs that are blocked from crawling to be indexed and to show up in the SERPs.
When a URL that is “Indexed, though blocked by robots.txt” shows up in Google’s search results, it looks like this example from the Dutch version of Zamnesia on zamnesia.nl, with a minimal snippet that claims that there is no information available for the page:
If you have many important SEO issues to take care of, this is definitely one that you can leave for later. If, on the other hand, like Zamnesia, your site is already very well optimised, it is a good idea to take care of issues like this one to provide an optimal search experience for your users.
Is there an alternative to blocking URLs via robots.txt?
An alternative approach to making sure that URLs are crawled a lot less frequently and also not indexed is the following:
- Remove or mask all internal links pointing to the URLs.
- Set the URLs to “noindex” or redirect them permanently (depending on the URL type).
- Make sure that the URLs are not blocked via the robots.txt file.
This is basically what we did with all the URL types listed above, and what helped us bring the number of “affected pages” in the “Indexed, though blocked by robots.txt” in GSC down to zero. At the same time, we made sure that we do not waste crawl resources and that we do not index URLs that we don’t want indexed.
Read on to learn about the details of each step for the URL types listed above.
“Add to cart” URLs
On Zamnesia, there is a URL type that adds a product to the shopping cart. Here is an example of such a URL:
https://www.zamnesia.com/cart?add&id_product=6000
When you access this URL with a browser, you are 302 redirected to the shopping cart and the product is added.
In the past, this type of URL was internally linked via the “add to cart” button. The URLs were blocked via the robots.txt file and the internal link was set to “nofollow” (which is generally not a good idea for internal links), but this did not keep Google from indexing these “add to cart” URLs.
We then made the following changes:
- Remove internal links to “add to cart” URLs from “add to cart” buttons: The “add to cart” button still adds products to the cart, but it no longer contains an <a> element that links to an “add to cart” URL.
- Remove the rules from the robots.txt file that block “add to cart” URLs.
By removing all internal links that point to this URL type, we made sure that Google does not discover new URLs of this type. The internal links were also a strong indexing & crawling signal for the URLs of this type, which was now eliminated.
Next, we needed to decide how to make sure that Google also removes the URLs from the index. We found out that the site already handled requests from user agents that do not store cookies differently from user agents that do stroe cookies:
- User agent stores cookies: 302 redirect to cart page and product is added to cart.
- User agent does not store cookies: 301 redirect to product page.
Googlebot falls into the second category of user agents, so we decided to leave this setting as it was. Pages with redirect are normally not considered as indexed by Google: The URLs now actually show up as “Not indexed > Page with redirect” in GSC.
So we managed to remove this URL type from the index, but what about the crawl resources that we wanted to save by blocking the URLs via robots.txt?
Google now does re-crawl the URLs once in a while, as it does with all known URLs that are not blocked via robots.txt. We’re fine with this though, as requesting a URL and getting back a 301 status code does not require many resources. Also, we can expect Googlebot to go easy on URLs that are not indexed due to a permanent redirect and that do not have any internal links pointing to them.
“Add to wishlist” URLs
The next URL type has a lot in common with the “add to cart” URLs discussed above. “Add to wishlist” URLs used to be linked internally on Zamnesia and they would redirect a user to the wishlist and add a product to it. Here is what this URL type looks like:
https://www.zamnesia.com/wishlist?id_product=6000&action=add_product
We made the same changes here that we made for “add to cart” URLs:
- Remove internal links to “add to wishlist” URLs from “add to wishlist” icons and links: The “add to wishlist” icons and links still add products to the wishlist, but they no longer contain <a> elements that link to “add to wishlist” URLs.
- Remove the rules from the robots.txt file that block “add to wishlist” URLs.
An important difference between “add to cart” and “add to wishlist” URLs on Zamnesia is that the feature that adds a product to the wishlist when the URL is requested by a browser was also retired. Now, “add to wishlist” URLs simply 301 redirect to the main wishlist URL, which is set to “noindex” via the robots meta tag:
https://www.zamnesia.com/wishlist
<meta name="robots" content="noindex,nofollow">
This behaviour is identical for all user agents, so Googlebot now receives a 301 to a URL that is set to “noindex” when it requests an “add to wishlist” URL.
After some consideration, we decided that this was good enough for our purposes and not to make any further changes. The “add to wishlist” URLs now show up as “Not indexed > Excluded by ‘noindex’ tag” in GSC. Google treats URLs that 301 to a page that is set to “noindex” as if they were set to “noindex” themselves.
Again, we can expect Googlebot to crawl these URLs less frequently, as they are no longer indexed and they have no internal links pointing to them.
A quick side note: Interestingly, we have managed to remove the first two page types from the index without adding a single “noindex” tag.
Paginated review pages
Most products on Zamnesia receive lots of reviews from customers. Reviews are embedded into product pages, but for each product there is also a separate review page that lists all reviews, like this example:
https://www.zamnesia.com/6000-zamnesia-seeds-runtz-feminized.html/reviews
The 50 latest reviews are shown on the main page, but all additional reviews are organised in a pagination series:
https://www.zamnesia.com/6000-zamnesia-seeds-runtz-feminized.html/reviews?iPage=2
https://www.zamnesia.com/6000-zamnesia-seeds-runtz-feminized.html/reviews?iPage=3
In the past, we felt that the crawl resources that were going into crawling all of this user-generated content could be used better in other places, so we blocked all paginated pages except for the first one via the robots.txt file.
It was enough for us to have one page rank for search queries like product name + reviews, but we did not think that all reviews, especially the older ones, had to be crawled and indexed.
The problem with this approach was, of course, that a robots.txt block only prevents crawling, but not indexing, so the paginated review pages remained indexed and ended up in the dreaded “Indexed, though blocked by robots.txt” report in GSC.
Here are the changes we made to fix this:
First, we masked the “next” and “previous” links between the paginated review pages, so that they are no longer followed by Google:
<div class="pagination_button" onclick="if (!window.__cfRLUnblockHandlers) return false; javascript: document.location.href='/6000-zamnesia-seeds-runtz-feminized.html/reviews?iPage=2'">Next »</div>
If you work in technical SEO, you might find this slightly funny, as it is not uncommon to come across such a “link” on a website and to have to ask the dev team to change it into a proper link that search engines can follow. In this case, we want to achieve the exact opposite.
Now that we have no more internal links pointing to paginated review pages, we can set them to “noindex” via the robots meta tag and lift the robots.txt block.
The paginated review pages now show up as “Not indexed > Excluded by ‘noindex’ tag” in Google Search Console – Mission accomplished.
Filtered category URLs
This is a URL type that probably exists on most e-commerce sites and has caused headaches for many technical SEO teams.
On Zamnesia, category filters used to link to URLs with parameters that reflected the selected filters, like this example:
https://www.zamnesia.com/33-headshop/38-bong?filter&material=metal
The links to these URLs were set to “nofollow” (again, using “nofollow” on internal links is never a good idea) and the URLs were blocked via the robots.txt file. You already know what this means: They were indexed all the same.
We made changes that are very similar to some of the ones described for previous URL types:
- Mask the filter links so that they are no longer followed by search engines: A click on a filter will still take the user to the filter URL with parameters, but there is no longer an <a> element that links to the URL.
- Set the URLs to “noindex” via the X-Robots-Tag HTTP header – We picked this method instead of the meta robots tag in the HTML because some old filter URLs had been redirected in the meantime and we wanted one easy solution for all filter URLs. A URL that redirects to another URL cannot have a meta robots tag, but it can be set to “noindex” via the HTTP header.
- Remove the rules from the robots.txt file that block filtered category URLs.
Again, mission accomplished: All filtered category URLs now show up as “Not indexed > Excluded by ‘noindex’ tag” in GSC.
Internal search result pages
Internal search result pages were not internally linked on Zamnesia, but some still showed up as “Indexed, though blocked by robots.txt” – Maybe because they were linked internally in the distant past, or externally from other websites.
Here is an example of an internal search result URL on Zamnesia:
https://www.zamnesia.com/search?search_query=gorilla+glue&orderby=position&orderway=desc
The required changes in this case were minimal:
- Set internal search result pages to “noindex” via the robots meta tag.
- Remove the rules from the robots.txt file that block internal search result pages.
Internal search result pages now show up as “Not indexed > Excluded by ‘noindex’ tag” in Google Search Console.
Old URLs with session IDs
At some stage in the past, like many e-commerce sites that have been around for a while, Zamnesia used URLs with session IDs:
https://www.zamnesia.com/blog-276p53cannabis-c16?SID=6boejop0t6qeqciknbfpng6gpm
Session IDs are an SEO nightmare, as a new one is normally generated for each session (hence the name) and they are often appended to internal links. This can result in a very high number of URLs with different session IDs for the same page.
Luckily, session ID URLs are a thing of the past on Zamnesia, but some of them are still known to Google or linked externally from other websites.
In order to remove this URL type from the “Indexed, though blocked by robots.txt” report, we decided to simply remove the blocking rules from the robots.txt file, as all session ID URLs were already 301 redirecting to their equivalents without session IDs.
Just like “add to cart” URLs, they now show up as “Not indexed > Page with redirect” in GSC.
Cart and checkout URLs
Cart and checkout pages are essential page types in every online store and they normally receive internal links from all other pages across the site.
As an attentive reader of this case study, you can already guess what we did to make cart and checkout pages disappear from the “Indexed, though blocked by robots.txt” report in GSC:
- Mask all internal links to cart and checkout pages, so that they continue to work for users, but are not followed by Google.
- Set cart and checkout pages to “noindex”.
- Remove the rules from the robots.txt file that block cart and checkout URLs.
Cart and checkout URLs now show up as “Not indexed > Excluded by ‘noindex’ tag” in Google Search Console.
Summary: What to do instead of blocking URLs via robots.txt
It would be repetitive to go into detail about even more URL types that showed up in the “Indexed, though blocked by robots.txt” report in GSC, as it would all boil down to the following advice:
- Make sure that the URLs in question are not internally linked. If they have to be internally linked for users, mask the links, so that search engines don’t follow them.
- Set the URLs to “noindex” or redirect them to targets that are either supposed to be indexed, or that are set to “noindex” (depending on the case).
- Lift the robots.txt block for the URLs in question.
By following these steps, you will make sure that the URLs are no longer indexed. You will not prevent them from being crawled, but you can expect Googlebot to crawl them a lot less frequently than important URLs, as they are no longer indexed and have no internal links pointing to them.
That’s it! If you have any questions or thoughts to share, please leave a comment under this article.
I would like to thank the amazing team at Zamnesia for allowing me to publish this case study, the brilliant developers who worked on this project for implementing all suggested changes, and SEO legend Esteve Castells for providing valuable feedback that helped me improve this article.
Leave a Reply