Why would I use a robots.txt file?

Question

From what I understand after reading Google's Controlling Crawling and Indexing:

The purpose of robots.txt file is to disallow crawling of some URLs, but those can still be indexed (and appear in search results) if they are linked from crawlable pages
To prevent a page from being indexed I need to make it crawlable and add the noindex meta tag in its head

So, why would I setup a robots.txt file if, in the end, it has no impact on whether pages appear in search results or not?

Aymane Shuichi · Answer 1 · 2014-03-11T08:58:21.160

0

in the end, it has no impact on whether pages appear in search results or not?

You are wrong , if you prevent a page from being crawled , it means it will never show up in result search.

Let's suppose that you have a thread page that contains pretty good and rich content , then you have a comment section,
the comment section has 10 pages but not very interesting content. You will need to prevent google from crawling these pages to act faster and reference the threads that have good content. That's just one example , there are other security reasons.

edited Mar 11 '14 at 08:58

answered Mar 11 '14 at 08:07

Aymane Shuichi

155
9

You misunderstood my last sentence: robots.txt disallow crawling but not indexing, therefore a page can still appear in search results even if it's disallowed in robots.txt – sdabet Mar 11 '14 at 08:10
2

@fiddler - that is correct. Indexation can still occur even if you have a Disallow directive in robots.txt. To prevent indexation, you'd have to use <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> in the <head> of the necessary pages. – zigojacko Mar 11 '14 at 08:30
@GeoffJackson-zigojacko Hence my question: why would I use a robots.txt then? – sdabet Mar 11 '14 at 08:30
@fiddler - To prevent the crawling of content on pages. Robots.txt may not prevent the indexing of URL's but it won't index the content - it'll just show a snippet along the lines of 'robots.txt prevented this page from being shown'. – zigojacko Mar 11 '14 at 08:31
@fiddler Indexing is a further step than crawling , you can not index a page if you aren't allowed to crawl it, indexing as mentioned is saving the content into a database (Google takes generally 250 characters of your content) , so if a robot is not allowed to fetch the page , it will never get the engine to index it. – Aymane Shuichi Mar 11 '14 at 08:52
@AymaneShuichi: that's wrong, see GeoffJackson-zigojacko's comment above – sdabet Mar 11 '14 at 08:59
You know why ? because the URL can be in the content of an allowed-to-crawl page , but it will never open the link and fetch its content. This is what I meant – Aymane Shuichi Mar 11 '14 at 09:01

Why would I use a robots.txt file?

1 Answers1