A common black-hat SEO technique is to generate many, many pages designed to seem relevant to search engines, but actually full of machine-generated crud. The idea is for the page to "look good to Google" and hit #1 in search results, but then, guide the user to some other sales page to try to sell them something.
There is a huge arms race in search, between SEOs using these black-hat techniques and search engineers trying to stop those tricks from working and punish the sites that use them.
Google's methods are very heavy on algorithmic methods to find these sites, which analyze sites on a "site by site" basis. This has less human intervention than competitors... the algorithm doesn't overlook sites, but it's also more prone to false positives.
For instance, let's say you have an internal site search, which creates a URL like
http://www.example.com/search.cgi/search+term+here
Every time someone does a search, it creates a URL like this. And these URLs wind up in your web log analysis pages which Google can see for some reason. So there are 500,000 URLs like the above, simply from 500,000 different searches people have done.
Every one of these pages has the internal search results from within your site, with the search term repeated many times in the various snippets and titles.
Compare this to the black-hat SEO technique. The pages are machine generated, stuffed with the main keyword, and don't have original content. They look exactly like black-hat doorway pages! A human curator can see it for what it is, but an algorithm can't.
So you want to exclude that "directory" so you don't get mistaken for a black-hat spammer.
X-Robots-Tagin addition to the 4 listed methods. Isn't another thing not to "use too much crawl power"? I think I read somewhere (citation needed) that if you say have 100.000 URLs, Google will somewhat rate limit how many of those are how often crawled. So excluding some could mean others are crawled more often. – kero Sep 07 '21 at 07:57X-Robts-Tag: noindexHTTP header is equivalent to the<meta name="robots" content="noindex">tag. I edited the answer to say that either would prevent indexing. – Stephen Ostermiller Sep 07 '21 at 09:25robots.txtfor infinite URL spaces. In general, Googlebot is willing to do a lot of crawling. Googlebot is usually willing to crawl 10 or 100 URLs for every real page. Your actual crawl budget is determined by PageRank. The more links into your site, the more Googlebot is willing to crawl. – Stephen Ostermiller Sep 07 '21 at 09:25