Why is preventing Googlebot from crawling some pages on my site good for SEO?

Question

Google says that it is effective to prevent Google Crawling Bot from crawling some useless pages in my site.

https://developers.google.com/search/docs/advanced/guidelines/how-search-works https://developers.google.com/search/docs/advanced/crawling/block-indexing

But I could not figure it out. Why is it effective to prevent Google Crawling Bot from crawling some useless pages to enhance SEO?

Stephen Ostermiller · Answer 1 · 2021-09-07T09:27:46.967

Some pages can hurt the SEO of your entire site if you let Google crawl and index them. From my answer to How should I choose which URLs I want indexed to include in my sitemap? there are several types of pages that Google doesn't want to index:

URLs without content
- Error pages
- Redirects
- Blank pages
Thin content
- Stub pages
- Tag or category pages that contain very few links
- User profile pages with little info filled out by the user
Poor quality content
- Auto generated text
- Machine translated text
- Bad spelling
- Bad grammar
- Broken links
- Spam
Duplicate content
- Internal duplication, URLs that duplicate another page on your site
- Scraped content
- Syndicated content
Infinite (or very large) URL spaces. Such as pages for:
- Site search
- Each number
- Every person on earth
- Every phone number of IP address
- Large combinations (like how to get from here to there for every combination of locations).
Temporary content
- Pages that will go away shortly
- Pages only useful when they are first viewed
Private content
- Pages requiring users to log in
- Admin pages
- Pages with sensitive data

For each of those, you actually have four options:

Let Googlebot crawl and index as it sees fit
Prevent Googlebot from crawling it using robots.txt
Prevent Googlebot from indexing it using a noindex meta tag or header
Tell Google about an alternative using redirects or canonical meta tags

Google will never index error pages or redirects so it is fine to let Googlebot crawl them decide not to index them.

For duplicate content it is usually best to redirect or tell Googlebot about the duplication with canonical tags.

I recommend preventing Googlebot from crawling very large (or infinite) URL spaces.

For other types of content you have to strike a balance between allowing lots of crawling, vs preventing indexing. Googlebot may occasionally index pages that it can't crawl, but to see the noindex Googlebot has to be able to crawl the URL.

If your site has only rich, original, informative URLs, then you can let Googlebot crawl the whole thing and index all of it.

Technically there is also the X-Robots-Tag in addition to the 4 listed methods. Isn't another thing not to "use too much crawl power"? I think I read somewhere (citation needed) that if you say have 100.000 URLs, Google will somewhat rate limit how many of those are how often crawled. So excluding some could mean others are crawled more often. — kero, Sep 07 '21 at 07:57
Yes, the X-Robts-Tag: noindex HTTP header is equivalent to the <meta name="robots" content="noindex"> tag. I edited the answer to say that either would prevent indexing. — Stephen Ostermiller, Sep 07 '21 at 09:25
Googlebot has a limited crawl budget per site. If you try to make Googlebot crawl a very large set of URLs, Googlebot may not be able to crawl all of them. Googlebot may end up neglecting URLs that you actually want indexed. Hence the advice about using robots.txt for infinite URL spaces. In general, Googlebot is willing to do a lot of crawling. Googlebot is usually willing to crawl 10 or 100 URLs for every real page. Your actual crawl budget is determined by PageRank. The more links into your site, the more Googlebot is willing to crawl. — Stephen Ostermiller, Sep 07 '21 at 09:25
“Google will never index error pages or redirects so it is fine to let Googlebot crawl them decide not to index them.‘ - lots of older platforms/frameworks, like WebForms, generate error page responses with HTTP 200 status values, so Googlebot doesn’t immediately know it’s an error page. The worst are sites that issue a HTTP 3xx redirection to a HTTP 200 static HTML page instead of doing a real HTTP 404 error. — Dai, Sep 07 '21 at 18:59
Google is pretty good about detecting (and not indexing) errors with 200 status. For example, they call pages that respond with 200 and a "not found" message to be soft-404. Google treats them pretty much like real 404 errors. But yeah, if your site isn't generating proper error statuses, there is certainly a risk that it could confuse Google and hurt your SEO. — Stephen Ostermiller, Sep 07 '21 at 19:11

Trebor · Answer 2 · 2021-09-06T20:21:50.507

4

Some pages may have special uses that you don't need Google to display in the search results.

For example, maybe you have a series of pages that asks the user for data in sequential steps. The first page asks for the user's name and maybe a password to complete an order. The second page asks for mailing address to complete the order.

While the first page is important for users to find in order to place an order, the second page is irrelevant without the first page. So, you add the meta tag <meta name="robots" content="noindex"> to the second page, preventing Google from showing the second page in it's search results.

edited Sep 06 '21 at 20:21

answered Sep 06 '21 at 15:21

Trebor

3,270
8
25

How would the bot get to the second page in such a scenario? Surely the first page's form would be a POST submission... – R.. GitHub STOP HELPING ICE Sep 06 '21 at 23:15
@R..GitHubSTOPHELPINGICE, that is true. It was only meant as an example. As with any example, if pushed too far, it can have problems. – Trebor Sep 07 '21 at 14:39

score 2 · Answer 3 · answered Sep 07 '21 at 02:58

A common black-hat SEO technique is to generate many, many pages designed to seem relevant to search engines, but actually full of machine-generated crud. The idea is for the page to "look good to Google" and hit #1 in search results, but then, guide the user to some other sales page to try to sell them something.

There is a huge arms race in search, between SEOs using these black-hat techniques and search engineers trying to stop those tricks from working and punish the sites that use them.

Google's methods are very heavy on algorithmic methods to find these sites, which analyze sites on a "site by site" basis. This has less human intervention than competitors... the algorithm doesn't overlook sites, but it's also more prone to false positives.

For instance, let's say you have an internal site search, which creates a URL like

 http://www.example.com/search.cgi/search+term+here

Every time someone does a search, it creates a URL like this. And these URLs wind up in your web log analysis pages which Google can see for some reason. So there are 500,000 URLs like the above, simply from 500,000 different searches people have done.

Every one of these pages has the internal search results from within your site, with the search term repeated many times in the various snippets and titles.

Compare this to the black-hat SEO technique. The pages are machine generated, stuffed with the main keyword, and don't have original content. They look exactly like black-hat doorway pages! A human curator can see it for what it is, but an algorithm can't.

So you want to exclude that "directory" so you don't get mistaken for a black-hat spammer.

In addition to looking like machine generated spam, site search pages look very similar to the Google search results. Google has said they consider it bad user experience for users to click off the Google search results only to land on other search results. Google will penalize sites that allow their search results to be indexed because of that similarity. See Matt Cutts: Search results in search results — Stephen Ostermiller, Sep 07 '21 at 09:37

Timur · Answer 4 · 2021-09-13T10:10:48.337

1

First of all, you should know that for SEO it is better to prevent indexing, not crawling. If you prevent crawling with robots.txt, pages can be indexed, even if you used meta tag or HTTP-header noindex. And of course, if it is appropriate, use the tag canonical.

You asked Why?. I know at least two reasons:

You are reducing use of Googlebot resources which makes indexing of other pages faster.
By excluding bad pages from indexing, you make the average quality of your indexed pages higher.

edited Sep 13 '21 at 10:10

answered Sep 13 '21 at 06:55

Timur

94
5

I took the liberty of making significant grammer and spelling changes to this post. You may want to check that I have not inadvertently changed what you were saying (Some things were not very clear to me) – davidgo Sep 13 '21 at 07:55
@davidgo, thank you, I have corrected it again. – Timur Sep 13 '21 at 10:12

Why is preventing Googlebot from crawling some pages on my site good for SEO?

4 Answers4

Linked