De-index URL parameters by value

Question

Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have parameters with certain values appended

I have a website example.com with language translations.

There used to be many translations but I deleted them all so that only English (Default) and French options remain.

When one selects a language option a parameter is aded to the URL. For example, the home page:

https://example.com (default) https://example.com/main?l=fr_FR (French)

I added a robots.txt to stop Google from crawling any of the language translations:

# robots.txt generated at http://www.mcanerin.com
User-agent: *
Disallow: 
Disallow: /cgi-bin/
Disallow: /*?l=

So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works.

But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it.

I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist.

This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS)

https://example.com/reports/view/884?l=vi_VN&l=hy_AM

This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123

It seems that parameters always load as long as everything before the question mark is a real URL.

So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.

Kamarul Anuar · Answer 1 · 2013-10-24T17:10:46.773

2

All permalink will be index by SE, exclude permalink have a symbol ?, =, and &.

User-agent: *
Disallow: 
Disallow: /cgi-bin/
Disallow: /*?*
Disallow: /*?
Disallow: /*=*
Disallow: /*=
Disallow: /*&*
Disallow: /*&
Allow: /

Will blocking all permalink format like /main?l=de_DE or /main?l=Any_Value_Here, excluding /main?l=fr_FR form SE.

User-agent: *
Disallow: 
Disallow: /cgi-bin/
Disallow: /main?l=*
Allow: /
Allow: /main?l=fr_FR

edited Oct 24 '13 at 17:10

answered Oct 24 '13 at 11:05

Kamarul Anuar

41
3

That would deindex all parameters, not just a value of de_DE while allowing fr_FR to be crawled and indexed. – Stephen Ostermiller Oct 24 '13 at 11:59
1

@Stephen Ostermiller - I had to edit my answer. – Kamarul Anuar Oct 24 '13 at 17:02
Thanks for the helpful information. But, due tot he navigation of the unique platform that my website is built upon I would not like to exclude all links that contain ?|=|& since some of them are paths to legit pages. It's only those that contain "?l=" that should be de-indexed – Doug Fir Oct 25 '13 at 15:30

score 1 · Answer 2 · edited Apr 13 '17 at 12:33

I would rewrite your URLs so that the language is a directory:

/main
/fr_FR/main

That has two advantages:

Robots.txt can be used to block certain languages but not other (without resorting to wildcards)
You can add directories to Google webmaster tools and change the geographic targeting of the directories (in that case it should be geo targeted to france)

See: How should I structure my URLs for both SEO and localization?

If you can't do that, you could noindex content based on parameter values. The easiest option is to use a meta robots noindex tag. You could put a tag in only when the parameter has a value that you don't want to have Google include in the index.

Thanks for the info Stephen. I would except that the site is built on a platform and I have limited ability to edit on my own. I was hoping there would be something I could do within GWT or the Robots.txt — Doug Fir, Oct 25 '13 at 15:33

score 0 · Answer 3 · answered Oct 24 '13 at 11:23

0

You can do several things:

is robots.txt, although take good care what you add in there as if can ALSO deindex some correct URLs. (i.e. if google did index hxxps://example.com/reports/view/884?l=fr_FR, you don't just want to lose it right?

Which brings me to the second:

Since you want to "deindex" old URLs, why not 301 them to "correct" ones or return a hard 404 or 410 on the real incorrect ones.
use rel="canonical" so you tell google what the correct page is (although it is "other" content)
use google webmastertools and specify what a parameter is used for. For the l= parameter you could specify "translates"

answered Oct 24 '13 at 11:23

Ronald Swets

269
1
3

I thought that my robots.txt should work? The old URLs do not exist any more but they still load since it's a parameter and the link up till the question mark still works. How would I add a 404 or 410 to, for example https://example.com/reports/view/884?l=vi_VN&l=hy_AM without affecting https://example.com/reports/view/884 ? – Doug Fir Oct 25 '13 at 15:35
1

Since you said in another comment you cannot really change your code, adding a 404 or 410 would be of course more difficult then. If someone links to the url with l=vi_VN (or for any other URL) it will still be indexed, except when you put it in your robots.txt. In this case yes, just use the robots.txt approach – Ronald Swets Oct 25 '13 at 18:08

De-index URL parameters by value

3 Answers3