Google CSE has indexed robots.txt itself

Question

Google CSE has indexed robots.txt and now if someone searches for 'txt' it returns the robots.txt file which is really not ideal (as this is a bog-standard Drupal site, the string robots.txt actually appears in the text). How can I avoid this? Is there a setting somewhere in Google or should I add /robots.txt to erm, robots.txt or...?

Stephen Ostermiller · Accepted Answer · 2018-09-12T14:05:47.773

5

You could add this to robots.txt:

Disallow: /robots.txt

In What if robots.txt disallows itself? Google's John Mueller says:

The only thing this would affect is if a link were pointing to the robots.txt and Google would otherwise index the content of the robots.txt file. That wouldn't be possible when it's disallowed by robots.txt.

So it seems that adding a disallow rule in robots.txt for robots.txt itself can help prevent robots.txt from getting indexed without preventing Googlobot from fetching the file to see what else is disallowed.

Another way to handle it would be to add a HTTP header to robots.txt that prevents indexing. This would be a similar solution to the problem Prevent XML sitemaps from showing up in Google search results. You would want the following HTTP header served for robots.txt:

X-Robots-Tag: noindex

Under Apache you would implement it with this .htaccess code:

<Files ~ "robots\.txt$">
  Header append X-Robots-Tag "noindex"
</Files>

edited Sep 12 '18 at 14:05

answered Sep 06 '18 at 02:07

Stephen Ostermiller

98,758
18
137
361

2

This is great information, it's very hard to google for this. One important note: Header is not always available, it's from mod_headers which for example Ubuntu doesn't load by default. – chx Sep 06 '18 at 13:02
1

Yes, you might have to enable the module with sudo a2enmod mod_headers – Stephen Ostermiller Sep 06 '18 at 13:24
1

Another hack that would work, I'm reluctant to even suggest this since it's so hacky, is to serve a HTML page, with a robots noindex meta tag, around the robots.txt directives. Search engines generally ignore invalid robots.txt directives, so you could add a simple HTML boilerplate on top, and still have the robots.txt directives in the file. This might throw other tools off, and I wouldn't recommend it for general use, but if you can't serve an x-robots-tag header and really, really need the URL gone ... – John Mueller Sep 12 '18 at 12:31
We know that Disallow: /robots.txt will not stop Google from fetching the file. But what about other search engines? Lots of search engines out there, big and small, and we don't know how they will react to that line. – Dave Jul 14 '23 at 02:58

Google CSE has indexed robots.txt itself

1 Answers1

Linked