Stop Google from accsesing pages without robots.txt

Question

Is it possible to get Google to stop crawling certain pages without using robots.txt?

My understanding is robots.txt can block Google from accsesing a page, but if using noindex (meta or header) Google may still crawl the page.

As I have millions of pages I want to block Google from ever accessing certain pages, to focus it's indexing on my more important pages.

Background:

My site had over 11 million pages in English. Many of these contained statistical information, for comparisson consider a site listing demographic information by postal code/ZIP. Due to requests I translated one type of page which accounts for most of the pages into several languages. This gives me a total of about 100 million pages. Around 250,000 of them have paragraph text on them - the rest just have statistical content.

According to Google Webmaster Tools, Google has indexed around 30 million pages. Many of the pages may be indexed in German or French, but not English. Since English is my main language I want to instead block all the non-English pages from Google expect non-English pages that have paragraph text; which I think will leave Google to largely index my new update of close to 30 million pages in English.

I believe the only way to do this would be to structure the site something like this:

domain.tld/en/ <- index everything in here
domain.tld/de/ <- index everything in here
domain.tld/dex/ <- block with robots.txt

So my pages with German paragraph text would be available via urls beginning /de/ and those without /dex/.

What is wrong with using robots.txt? It looks like you have a good solution that uses robots.txt. — Stephen Ostermiller, Aug 31 '18 at 12:46
The requirement of putting them in a different folder means there is another layer of complexity. It would require a re-writing of caching and HTTP request handling. But it seems the only option. — Kohjah Breese, Aug 31 '18 at 17:34
Googlebot can also understand wildcards against the URL in robots.txt. Could you apply a pattern instead? — Stephen Ostermiller, Aug 31 '18 at 18:02

Emirodgar · Answer 1 · 2018-08-31T09:34:06.523

1

It's true that robots.txt indications could be easily ignored by searchers in certain situations.

Knowing this, in addition to the noindex tag I also return a 410 status (GONE) instead of 200 (OK). That prevents the URL from being indexed even when it has links from external sites.

To sum up the needed actions:

NoIndex or robots.txt disallow
410 status

In case you need to index some language subfolder for a period of time before deindexing it, you can use unavailable_after tag.

edited Aug 31 '18 at 09:34

answered Aug 31 '18 at 07:48

Emirodgar

3,817
6
19

You can't return a 410 to Google when you would return 200 to someone else. That's called clocking and you get very badly penalized for that. – Alexis Wilke Aug 31 '18 at 19:56
1

How do you do this practically: load an url with original content, like with 200, but throw 410? Nice idea in theory. I mean, you should throw 410 to all, users AND search engines, correct? – Evgeniy Aug 31 '18 at 20:24
@AlexisWilke actually, cloaking it's the opposite. You treat searchers and users different in order to rank for something you are not showing to users. In this case, you want your page out from SERPs. Check John Mueller opinion about 410: https://twitter.com/JohnMu/status/768352302671495168 – Emirodgar Sep 03 '18 at 07:01
1

@Evgeniy correct, for everyone is a 410 status. If your are interested, check this test about deindexing methods (it's in spanish, but you can easily translate it into english) https://www.mjcachon.com/blog/test-desindexar-urls/ – Emirodgar Sep 03 '18 at 07:06
@Emirodgar yes, i understand that you use 410 as deindexing method. I mean, how do you do this in terms of setup: showing a content page, but with an error status code? Is it not so, that the server triggers displaying of an error page in case of error status code? – Evgeniy Sep 03 '18 at 10:04
The user sees a normal page with the content he is waiting for, but -internally- the server is sending a 410 status (doesnt' affect the visitor). Check an example here: https://i.imgur.com/pTQOdN7.png I use PHP header function http://php.net/manual/es/function.header.php. – Emirodgar Sep 03 '18 at 10:12
@Emirodgar i admit, very smart:) did you tested, how noindex+404 suits for temporary noindexing, like for christmas or easter products? How faster is it in comparison to just noindex? – Evgeniy Sep 03 '18 at 10:24
thanks :) I always used it for permanently noindexing (no temporary, sorry). There is not an easy answer cause each case is different but I'll say noindex+410 is double speed that noindex only . – Emirodgar Sep 03 '18 at 10:42
@Emirodgar do you know, if i add 410 not with PHP header, but with htaccess RewriteRule and something like [G] or [R=410] would i get the same effect? Or would be the error page triggered? – Evgeniy Sep 03 '18 at 15:34
I have to say I did not understand you answer at first. What you are saying makes sense now. However, I would view it as not clean. Also it does not help the OP since Google still has to read the page and its content to know that's a 410 Gone and thus you still use the exact same bandwidth. I have some 410 and Google still checks those pages regularly (albeit less than a 200 that become a 404...) – Alexis Wilke Sep 04 '18 at 16:15
@Evgeniy sorry, never did the other way so no idea how it would be interpreted by searchers or if it's going to generate an error. – Emirodgar Sep 05 '18 at 06:38
@Alexis Wilke based on my experience (+1M indexed URLs) 410 helped Google to focus on the relevant URLs (200). Maybe they still visit a 410 (even when it's marked as permanent gone) but they don't drop the rest. Check this: https://webmasters.stackexchange.com/questions/25609/does-it-make-sense-to-return-a-410-instead-of-404-when-some-page-has-been-perman – Emirodgar Sep 05 '18 at 06:46
@Emirodgar did you checked access logs after setting pages to 410? Would be interesting to know, whether the bot is coming again. – Evgeniy Sep 05 '18 at 08:12
No, I didn't. Just assumed google bot ignored them as the number of indexed relevant pages increased. Next time I'll follow up google bot requests on 410 pages. – Emirodgar Sep 05 '18 at 08:54
@Evgeniy I found some interesting information related to how Google's bot works with 410 status https://www.youtube.com/watch?time_continue=1723&v=kQIyk-2-wRg – Emirodgar Oct 16 '18 at 12:59
@Emirodgar thank you! The main profit i see in usage of 410, is that G in such case doesn't coming again to check, whether an url were available again - it definitely kicks the url out from index 4eva. But the higher deindexing speed is a profit too - JohnMu is wellknown for his slightly understatements:) – Evgeniy Oct 16 '18 at 14:00

score 1 · Answer 2 · answered Sep 04 '18 at 17:33

1

First of all, I don't think that trying to force Google into anything is a good idea. It may temporarily work and then fail badly at some point.

I would use the locale to return the page content in the right language. Google indexes locale adaptive pages since Jan 2015.

What this means is simple, the browser sends you a header tell you which languages the user prefers. Using that information, you return the page in English, German, French... The exact same URL. So you have nothing to block or worry about because Google is capable of requesting all the versions, somehow. (because Google Deutschland will request de as its primary language, then probably fallback to en.)

Of course, that requires your system to support such a feature.

Now, I've seen many websites trying to show languages using a path (as you've shown, http://example.com/fr) or a sub-domain (http://fr.example.com/), but in reality those solutions are wrong since you really are saving the exact same page with content tweaked for the language... so there should be no need to create a completely separate page. Instead, just make sure you server the correct language and you should be just fine after that. Google may still end up adding more of your other language pages in their index. I'm not too sure why that would happen, but it should not change the number of pages indexed in English nor the number of hits you currently get.

answered Sep 04 '18 at 17:33

Alexis Wilke

2,235
12
17

Thanks. I wasn't aware of this. However. The issue is that my site it too big for Google to consider indexing all of this largely statistical content. From some knowledge on the server log I would guess Google crawls 300,000 pages per day, or 9 million per month. The site is now around 210 million pages. So at the current rate it would take Google 24 months to index them all. As a result they are dropping pages; and they are dropping pages in English, which is the main language I want traffic for.
I am looking for a solution to stop them crawling certain pages, but I think the one I ...
– Kohjah Breese Sep 04 '18 at 23:26
... listed is the only way to stop crawling pages, rather than indexing. But as that is quite a considerable change to various aspects I've just used noindex for now and will probably block 2 or 3 languages completely. Later I'll move to the solution proposed in the OP. – Kohjah Breese Sep 04 '18 at 23:27
The only two ways I know of for preventing Google from accessing the data are the robots.txt you mentioned (which requires the use of different paths in order to block languages as you mentioned) and making pages only accessible to registered users. So that could be a different way of doing things. Since you do not want those pages indexed by google anyway.... and make it very easy to create an account so your users don't hesitate to do it. – Alexis Wilke Sep 05 '18 at 21:27
Of course, to be useful, the pages that require a login need to NOT be linked from pages that don't have a login. (or at least make sure to include the rel="nofollow" which Google respects.) That way Google would definitely not attempt to read those pages. – Alexis Wilke Sep 05 '18 at 21:29

Stop Google from accsesing pages without robots.txt

2 Answers2