Questions tagged [robots.txt]

Robots.txt is text file used by Website owners to give instructions about their site to web robots. Basically it tells robots which parts of the site are open and which parts are closed. This is called The Robots Exclusion Protocol.

743 questions
31
votes
6 answers

If I don't want to set any special behavior, is it OK if I don't bother to have a robots.txt file?

If I don't want to set any special behavior, is it OK if I don't bother to have a robots.txt file? Or can the lack of one be harmful?
Dan Dumitru
  • 588
  • 7
  • 14
22
votes
3 answers

What is a minimum valid robots.txt file?

I don't like that I see a lot of 404 errors in the access.log of my web server. I'm getting those errors because crawlers try to open a robots.txt file, but couldn't find any. So I want to place a simple robots.txt file that will prevent the 404…
bessarabov
  • 343
  • 2
  • 7
12
votes
3 answers

Robots.txt: do I need to disallow a page which is not linked anywhere?

There are some pages on my website that I want the user to be able to visit only if I give him/her the URL. If I disallow the single pages in robots.txt, they will be visible by anybody looking into it. My question is: if I don't link them from…
martjno
  • 223
  • 1
  • 4
10
votes
2 answers

Allow a folder and disallow all sub folders in robots.txt

I would like to allow folder /news/ and disallow all the sub folders under /news/ e.g. /news/abc/, /news/123/. How can I do that please? I think Disallow: /news/ will block everything in it, including /news/ itself. Will Disallow: /news/*/ do the…
Stickers
  • 287
  • 1
  • 3
  • 7
8
votes
3 answers

What's the proper way to handle Allow and Disallow in robots.txt?

I run a fairly large-scale Web crawler. We try very hard to operate the crawler within accepted community standards, and that includes respecting robots.txt. We get very few complaints about the crawler, but when we do the majority are about our…
Jim Mischel
  • 643
  • 1
  • 5
  • 6
8
votes
1 answer

Does the line-ending format of robots.txt matter?

Simple question: Should I make sure to use Unix line endings for my robots.txt, or does it not matter?
James Sulak
  • 183
  • 1
  • 5
8
votes
2 answers

How do you disallow root in robots.txt, but allow a subdirectory?

Using robots.txt, how do you disallow the root of a site (http://www.example.com/) but allow a subdirectory (http://www.example.com/lessons/)?
David Smith
  • 281
  • 3
  • 6
7
votes
1 answer

What if robots.txt disallows itself?

User-agent: * Disallow: /robots.txt What happens if you do this? Will search engines crawl robots.txt once and then never crawl it again?
clickbait
  • 382
  • 1
  • 2
  • 13
5
votes
2 answers

Is there any reason for putting humans.txt except of acknowledgement?

Are there any valid reasons of putting humans.txt? The only reason I see so far is to give credit to the team who created the site, and open source libraries it is using.
Salvador Dali
  • 359
  • 4
  • 16
5
votes
1 answer

Do we need to block repeated pages content for SEO relevance

I have multiple purchase pages with the same content like: product1red.php product2green.php Should I block them with robots.txt ?
user32057
  • 61
  • 2
5
votes
2 answers

What is the correct way to write my "robots.txt" file?

I have written the following code inside my robots.txt file: User-Agent: Googlebot Disallow: User-agent: Mediapartners-Google Disallow: Sitemap: http://example.com/sitemap.xml Is my robots.txt is correct? I only want two user agent…
ashutosh
  • 316
  • 1
  • 3
  • 14
5
votes
1 answer

wget not respecting my robots.txt. Is there an interceptor?

I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little. I have implemented a robots.txt policy. I posted…
Jane Wilkie
  • 161
  • 6
4
votes
2 answers

I don't want my site to be analyzed on WooRank or builtwith.com

I don't want my site to be analyzed on WooRank or builtwith.com. Is there any way I can do that by editing the robots.txt file or any other possible way?
Krill
  • 41
  • 1
4
votes
3 answers

How to disallow robots from the first 185 pages?

I have a website that whereby the first 185 pages are sample profiles for demonstration purpose: http://example.com/profile/1 ... http://example.com/profile/185 I want to block these pages from Google as they are somewhat similar in content to…
Question Overflow
  • 1,598
  • 5
  • 18
  • 24
4
votes
2 answers

robots.txt - just a guess about wild-card

so if I disallow tempPage, does it mean tempPage_1, temp_Page_2, tempPage_x are also disallowed? I tried to google this up, but I don't know...
TPR
  • 319
  • 1
  • 5
1
2 3 4 5