Are web crawlers able to find a secondary robots.txt in a sub-directory?

Question

I have a sub-directory that I would like to hide from web crawlers.

One way to do this is to use a robots.txt in the root directory of the server and add a rule to exclude that sub-directory. However, anyone with basic web knowledge can access the robots.txt contents and figure out the disallowed directories.

I thought a way to avoid this, but I am not sure if will work.

Let X be the name of the sub-directory that I want to exclude. One way to stop web crawlers indexing the X directory and at the same time make it harder for someone to identify X directory from root's robots.txt is to add a robots.txt in the X directory.

My questions are the following:

Will the web crawlers find the robots.txt in the sub-directory given that a robots.txt already exist in the root directory as well?
If robots.txt is in the X sub-directory, then should I use relative or absolute paths?
```
User-agent: *
Disallow: /X/
```
or
```
User-agent: *
Disallow: /
```

I'm not sure why you want to avoid people reading your disallowed directories, but if there's sensitive data then it shouldn't just be blocked from robots but also restricted by a login or some other security. — Andrew Lott, Jan 31 '16 at 19:56
That is not the case, I would like one specific user to have access to it (nothing sensitive in terms of data, just for privacy) and I would like not to lock the directory. — , Jan 31 '16 at 19:58
If you have privacy concerns, then you should secure the data some other way. That's not what robots.txt is for. — Andrew Lott, Jan 31 '16 at 19:59
@AndrewLott this is true, but if you have an admin login page you don't want to expose the URL to that page because it will cause bots to hammer it with requests. For example if I have an admin URL /admin_secretsausage I don't want Google to index that, but I also don't want to expose that URL. — BugHunterUK, Apr 13 '20 at 15:52

score 20 · Accepted Answer · edited Jun 16 '20 at 10:32

No, web crawlers will not read or obey a robots.txt file in a subdirectory. As described on the quasi-official robotstxt.org site:

Where to put it

The short answer: in the top-level directory of your web server.

or on Google's help pages (emphasis mine):

A robots.txt file is a file at the root of your site that indicates those parts of your site you don’t want accessed by search engine crawlers.

In any case, using robots.txt to hide sensitive pages from search results is a bad idea anyway, since search engines can index pages disallowed in robots.txt if other pages link to them. Or, as described on the Google help page linked above:

You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file.

So what should you do instead?

You can let search engines crawl the pages (if they find them), but include a robots meta tag with the content noindex,nofollow. This will tell search engines not to index those pages even if they do find links to them, and not to follow any further links from those pages. (Of course, this will only work for HTML web pages.)
For non-HTML resources, you can configure your web server (e.g. using an .htaccess file) to send the X-Robots-Tag HTTP header with the same content.
You can set up password authentication to protect the sensitive pages. Besides protecting the pages from unauthorized human visitors, it will also effectively keep web crawlers away.

Fantastic, is a static HTML page, which by adding the meta tag will do the trick. Thank you. — , Jan 31 '16 at 20:05

score 4 · Answer 2 · answered Jan 31 '16 at 19:54

4

Your robots.txt should be in the root directory and should not have any other name. According to the standard specification:

This file must be accessible via HTTP on the local URL "/robots.txt".

answered Jan 31 '16 at 19:54

Andrew Lott

5,854
3
23
44

That said, the web crawlers won't look any other directory for robots.txt? – Jan 31 '16 at 19:59
1

Not that I've ever seen. /robots.txt is the standard, so how would search engines even know where to look otherwise? – Andrew Lott Jan 31 '16 at 20:00

score 1 · Answer 3 · answered Oct 24 '19 at 19:27

You CAN actually use a robots.txt in a sub-directory. This is currently how we treat our language subdomains. We use a 301 redirect form the /robots.txt to a /lang/robots.txt (per sub domain) and it is being picked up correctly.

It also is picking up the folder structure as the correct root, when using a simple forward slash. eg. disallow: /

is treated as disallowing everything and not just the current subdirectory the {redirected} robots.txt file resides in.

But again, we redirect with a 301 and have that in place, so without a 301, I doubt it would ever be found...

Are web crawlers able to find a secondary robots.txt in a sub-directory?

3 Answers3