Screaming frog runs out of memory: will this affect crawling and ranking?

Question

My team and I are working on a new e-commerce (Magento 2) that will be launched in a few months. I'm in charge of the SEO part and I'm using Screaming Frog -crawling the subdomain test (test.example.com)- to check technical elements such as broken internal links, alt tags, meta tags, etc. so everything will be OK when we go live.

The problem is that Screaming Frog runs out of memory when it's around 9-15% of the crawling. This eCommerce site has around 300 products, so it's not super large. I'm pretty sure this memory issue comes from the several product filters. I've asked developers to use no-follow tags for internal links coming from filters so crawlers don't waste time crawling and indexing these thousands of URLs with parameters.

While I do know that there's an option in Screaming Frog to increase the memory, my concern is that once the website is live, Google will also require too much crawl budget and this causes indexing problems.

Mike Ciffone · Answer 1 · 2021-11-10T19:44:38.167

Screaming Frog uses your computer's memory, so lowering the amount of URLs crawled per second will likely help you. Will that get you through the whole site? I don't know. Perhaps just run it off a computer with beefier RAM.

However, the reason you're running into this in the first place, as you posited in your question is because of your product filters. Faceted navigation is basically a necessity for users (especially on massive sites like Amazon or Ebay). However, it comes with inherent SEO problems that all e-commerce sites have to deal with.

Solutions to Faceted Navigation SEO Problems

"Noindex, follow" - reduces duplicate content, but wastes both crawl budget and link equity.
Canonicalization - Consolidates link equity, wastes crawl budget.
Disallow via robots.txt - Easy to implement, works well, but traps link equity. Not always a sure fire method to keep pages from being indexed. From Google:

A page that's disallowed in robots.txt can still be indexed if linked to from other sites. While Google won't crawl or index the content blocked by a robots.txt file, we might still find and index a disallowed URL if it is linked from other places on the web.

"Nofollow" internal links - Good for unwanted facets, but duplicate content can be indexed and you still trap link equity.
JavaScript - Ideally the perfect solution, but not always reliable since Google can execute it. The idea is to build your faceted navigation with AJAX so that the page doesn't reload. If there's nothing to crawl, Google can't index it, and your link equity won't get diluted.

Most SEOs that specialize in e-commerce will tell you that it takes a lot of trial and error experience to be able to understand the right mix for a given site. All sites are different.

Which Option is Best for You?

It really comes down to this question: Do you care more about crawl budget, or link equity? For a new site, especially one relatively small, I think the clear answer is link equity.

"T-Shirt size" your faceted URLs by importance/usefulness. Pay attention to your traffic analytics to determine common filters your customers enjoy.

If I was you I'd set noindex, follow on relevant/good/beneficial facets (you will want some crawled, just not indexed), and then either nofollow or disallow via robots.txt the ones that are clearly redundant/duplicative. After this I'd set "top level" product pages as canonicals. For example if we have: /black-jackets?under-200/ I'd set /black-jackets/ as canonical URL so it gets the authority/link equity. And since someone will probably ask - no, Google wouldn't see the “under $200” page as a duplicate.

That said, no matter which route you go there's going to be tradeoffs. We have another question where Stephen Ostermiller provides a great answer that details why you shouldn't try to make a faceted navigation "SEO friendly". While I agree with him, as we've just discussed, there are some things you can do to help.

I don't think the JS route is worth it for a site of your size. It's also probably not realistic either (without big tech debt) since since your team has already built the functionality for a faceted navigation.

What do you mean by "disallows aren't always respected?" Major search engines nearly 100% respect it and won't crawl disallowed URLs except in some corner cases around the time you make changes to robots.txt. I'm not not sure that other bots not related to search engines are worth caring about in this case. — Stephen Ostermiller, Nov 10 '21 at 17:34
@StephenOstermiller I should have used a different word than "respected". I've updated that section. — Mike Ciffone, Nov 10 '21 at 17:44
It isn't a big deal if Google indexes faceted navigation pages that it can't crawl. The goal is to prevent unnecessary crawling, not to keep the URLs out of search results or keep them secret. I'd say that robots.txt is the hands down the best thing to do first in this case because it is so easy to implement and gets you 90% of the way there. — Stephen Ostermiller, Nov 10 '21 at 17:56
Advanced implementations of faceted navigation allow some crawling and indexing. For example when single filters are applied such that the category has more than just a couple results in it. — Stephen Ostermiller, Nov 10 '21 at 17:57
Re: "It isn't a big deal if Google indexes faceted navigation pages that it can't crawl" - what about duplicate content? Traps link equity too — Mike Ciffone, Nov 10 '21 at 19:16
If Google can't crawl it, Google doesn't know whether the content is duplicate or not. In any case, Google doesn't penalize entire sites for duplicate content, Google usually chooses to index only one copy of duplicated content. — Stephen Ostermiller, Nov 10 '21 at 19:20
I'm not sure exactly what you mean by "trapping" link equity. Links to URLs that are disallowed in robots.txt drop Pagerank on the floor the same way that nofollow on links does. However, that doesn't appear to hurt other pages on your site when it happens. If you get links into faceted navigation you could lose that Pagerank. However, that is true for any solution where the pages aren't indexed. Google says they only pass Pagerank through indexed pages. — Stephen Ostermiller, Nov 10 '21 at 19:24
My understanding is that no link equity can be passed from blocked page to link destination. Either way, I've done additional research of my own since I haven't crossed this in awhile and I also like robots.txt for this. I made an edit to include it as an option. — Mike Ciffone, Nov 10 '21 at 19:47
I'm not trying to pick on you. Your answer is mostly excellent. I'd just make a different recommendation about which solution to choose. Maybe I'll put my full reasoning into my own answer. — Stephen Ostermiller, Nov 10 '21 at 19:49

score 2 · Answer 2 · answered Nov 11 '21 at 11:42

I largely agree with Mike Ciffone's answer:

Your faceted navigation is generating a very large number of URLs due to combinations of filters.
Screaming Frog has to crawl so many URLs that it is running out of memory.
While search engine bots won't run out of memory, they will have trouble crawling your site, possibly exhausting their crawl budget without finding your important pages. You need to do something to limit the amount of crawling they do.

Use `robots.txt`

The first line of defense is to prevent crawling with robots.txt. Exactly what rule you should use depends on what your URLs look like with navigation filter parameters. If your URLs look like:

/filter?category=shirts&color=blue

You could use:

Disallow: /filter

However if your filter parameters are directly on your category pages like

/shirts?color=blue

then you could use wildcards that disallow crawling of all URLs with parameters:

Disallow: /*?

Disallowing URLs in robots.txt is by far the easiest solution to implement. It is very effective at preventing search engine bots from doing too much crawling and wasting their crawl budget. It has some relatively minor drawbacks:

Links to pages that are disallowed in robots.txt drop link juice (Pagerank). In theory this could reduce the amout of link juice available to the other pages you link to. In practice, it doesn't seem to hurt your site. I've run tests of removing huge numbers of links to disallowed content and not seen any ranking changes for the remaining content.
Google may occasionally choose to index a disallowed page if it gets a number of external links. When Google does so, it can't show a description from the page in the search results because it is blocked from crawling the page to fetch its content, so it shows a generic message instead. See Why do Google search results include pages disallowed in robots.txt?

Avoid `nofollow`

nofollow attributes on links are not going to be effective because it doesn't prevent crawling. Google won't index pages that only have nofollow links, but it does crawl them. See Does a "nofollow" attribute on a link prevent URL discovery by search engines?

Using nofollow does the opposite of what you want. You don't want Googlebot to waste to much effort crawling so many URLs, but you don't mind that it indexes those pages if it has crawled them.

In addition, nofollow has the same problem robots.txt has: it drops Pagerank on the floor.

At this point, I don't see any use cases for using nofollow on internal links. It You should have your developers remove the nofollows that you had them add.

Avoid `noindex,follow`

For a long time, we assumed that was possible to noindex a page but still pass link juice through it and many guides suggest using <meta name=robots content=noindex,follow> to prevent a page from getting indexed. Google says that isn't the case and that pages that aren't indexed won't pass link juice to other pages.

In any case, to see the meta tag, Googlebot has to crawl the pages. To prevent Googlebot from crawling too much, you can't allow Googlebot to crawl to see meta tags in every URL of your filters.

Allow limited crawling of filters

There is SEO value in allowing a small number of filter pages to be crawled. Most of the problems come from combinations of filters. Google doesn't need to know that you can filter down to "Blue shirts for men under $20 by brandx". However, you might want Google to find pages for "blue shirts", "men's shirts", and "brandx shirts."

This can be implemented by relaxing the robots.txt rules a little bit. For example you might only disallow crawling URLs with more than one parameter:

Disallow: /*?*&

Alternately you could rewrite URLs so that the URL for "blue shirts" is /shirts/blue while the URL for the big combination is /shirts?color=blue&brand=brandx&maxprice=20&department=mens

Implement filters without URL parameters

Another way to avoid the problem is to drop URL parameters from your URLs and find other ways to implement the filters. If there are no filter parameters in your URLs, bots aren't going to see too many URLs to crawl.

The downside of these approaches is that users can't bookmark their filter results or send a link to a friend.

There are a couple approaches that can be used to do so:

POST requests

Search engine bots don't submit POST data to sites and posted data is hidden from URLs. You can use this to implement filters without creating massive URL combinations. The basics of making it work is to change action of the form with the filter controls from "GET" to "POST".

Filter with JavaScript

You can program client side code to filter products on your site without changing the URL. The JavaScript can either load the entire category product catalog and just show portions of it, or use AJAX calls to your backend to retrieve filtered results.