I largely agree with Mike Ciffone's answer:
- Your faceted navigation is generating a very large number of URLs due to combinations of filters.
- Screaming Frog has to crawl so many URLs that it is running out of memory.
- While search engine bots won't run out of memory, they will have trouble crawling your site, possibly exhausting their crawl budget without finding your important pages. You need to do something to limit the amount of crawling they do.
Use robots.txt
The first line of defense is to prevent crawling with robots.txt. Exactly what rule you should use depends on what your URLs look like with navigation filter parameters. If your URLs look like:
/filter?category=shirts&color=blue
You could use:
Disallow: /filter
However if your filter parameters are directly on your category pages like
/shirts?color=blue
then you could use wildcards that disallow crawling of all URLs with parameters:
Disallow: /*?
Disallowing URLs in robots.txt is by far the easiest solution to implement. It is very effective at preventing search engine bots from doing too much crawling and wasting their crawl budget. It has some relatively minor drawbacks:
- Links to pages that are disallowed in
robots.txt drop link juice (Pagerank). In theory this could reduce the amout of link juice available to the other pages you link to. In practice, it doesn't seem to hurt your site. I've run tests of removing huge numbers of links to disallowed content and not seen any ranking changes for the remaining content.
- Google may occasionally choose to index a disallowed page if it gets a number of external links. When Google does so, it can't show a description from the page in the search results because it is blocked from crawling the page to fetch its content, so it shows a generic message instead. See Why do Google search results include pages disallowed in robots.txt?
Avoid nofollow
nofollow attributes on links are not going to be effective because it doesn't prevent crawling. Google won't index pages that only have nofollow links, but it does crawl them. See Does a "nofollow" attribute on a link prevent URL discovery by search engines?
Using nofollow does the opposite of what you want. You don't want Googlebot to waste to much effort crawling so many URLs, but you don't mind that it indexes those pages if it has crawled them.
In addition, nofollow has the same problem robots.txt has: it drops Pagerank on the floor.
At this point, I don't see any use cases for using nofollow on internal links. It You should have your developers remove the nofollows that you had them add.
Avoid noindex,follow
For a long time, we assumed that was possible to noindex a page but still pass link juice through it and many guides suggest using <meta name=robots content=noindex,follow> to prevent a page from getting indexed. Google says that isn't the case and that pages that aren't indexed won't pass link juice to other pages.
In any case, to see the meta tag, Googlebot has to crawl the pages. To prevent Googlebot from crawling too much, you can't allow Googlebot to crawl to see meta tags in every URL of your filters.
Allow limited crawling of filters
There is SEO value in allowing a small number of filter pages to be crawled. Most of the problems come from combinations of filters. Google doesn't need to know that you can filter down to "Blue shirts for men under $20 by brandx". However, you might want Google to find pages for "blue shirts", "men's shirts", and "brandx shirts."
This can be implemented by relaxing the robots.txt rules a little bit. For example you might only disallow crawling URLs with more than one parameter:
Disallow: /*?*&
Alternately you could rewrite URLs so that the URL for "blue shirts" is /shirts/blue while the URL for the big combination is /shirts?color=blue&brand=brandx&maxprice=20&department=mens
Implement filters without URL parameters
Another way to avoid the problem is to drop URL parameters from your URLs and find other ways to implement the filters. If there are no filter parameters in your URLs, bots aren't going to see too many URLs to crawl.
The downside of these approaches is that users can't bookmark their filter results or send a link to a friend.
There are a couple approaches that can be used to do so:
POST requests
Search engine bots don't submit POST data to sites and posted data is hidden from URLs. You can use this to implement filters without creating massive URL combinations. The basics of making it work is to change action of the form with the filter controls from "GET" to "POST".
Filter with JavaScript
You can program client side code to filter products on your site without changing the URL. The JavaScript can either load the entire category product catalog and just show portions of it, or use AJAX calls to your backend to retrieve filtered results.
robots.txt. I'm not not sure that other bots not related to search engines are worth caring about in this case. – Stephen Ostermiller Nov 10 '21 at 17:34robots.txtis the hands down the best thing to do first in this case because it is so easy to implement and gets you 90% of the way there. – Stephen Ostermiller Nov 10 '21 at 17:56robots.txtdrop Pagerank on the floor the same way thatnofollowon links does. However, that doesn't appear to hurt other pages on your site when it happens. If you get links into faceted navigation you could lose that Pagerank. However, that is true for any solution where the pages aren't indexed. Google says they only pass Pagerank through indexed pages. – Stephen Ostermiller Nov 10 '21 at 19:24