What if robots.txt disallows itself?

Question

User-agent: *
Disallow: /robots.txt

What happens if you do this? Will search engines crawl robots.txt once and then never crawl it again?

What's the purpose of asking this - was it done by mistake, or did you intend something else? Obviously the search engine would need to crawl your robots.txt file first, so it would likely just ignore this directive. You could of course test it with Google's robots.txt Tester, which will highlight syntax warnings and logic errors (the latter of which this would likely fall under). — dan, Aug 10 '18 at 03:54

score 9 · Accepted Answer · answered Aug 10 '18 at 10:00

9

Robots.txt directives don't apply to robots.txt itself. Crawlers may fetch robots.txt even if it disallows itself.

It is actually very common for robots.txt to disallow itself. Many websites disallow everything:

User-Agent: *
Disallow: /

That drective to disallow everything would include robots.txt. I myself have some websites like this. Despite disallowing everything including robots.txt, search engine bots refresh the robots.txt file periodically.

Google's John Mueller recently confirmed that Googlebot still crawls a disallowed robots.txt: Disallowing Robots.txt In Robots.txt Doesn't Impact How Google Processes It. So even if you specifically called out Disallow: /robots.txt, Google (and I suspect other search engines) wouldn't change their behavior.

answered Aug 10 '18 at 10:00

Stephen Ostermiller

98,758
18
137
361

3

The only thing this would affect is if a link were pointing to the robots.txt and Google would otherwise index the content of the robots.txt file. That wouldn't be possible when it's disallowed by robots.txt. It's extremely rare that there's any useful content (for search / indexing) in a robots.txt file (I think I've seen maybe 2 cases over the years, and they're both gone now), so unlikely that you'd ever miss out on anything. – John Mueller Aug 19 '18 at 20:50
2

@JohnMueller We just had a case of robots.txt itself being indexed in Google CSE. At least I know how to answer that question now. https://webmasters.stackexchange.com/questions/117536/google-cse-has-indexed-robots-txt-itself – Stephen Ostermiller Sep 06 '18 at 02:08
We run a website with millions of court cases. At one point, before we switched to doing things properly with noindex HTTP headers, we would list court docs in our robots.txt file. The result was that those same cases could wind up in Google by way of robots.txt showing up in Google. In other words, robots.txt became a searchable index of everything we were trying to (lightly) hide. – mlissner Oct 26 '22 at 00:21

What if robots.txt disallows itself?

1 Answers1

Linked