1

Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have parameters with certain values appended

I have a website example.com with language translations.

There used to be many translations but I deleted them all so that only English (Default) and French options remain.

When one selects a language option a parameter is aded to the URL. For example, the home page:

https://example.com (default) https://example.com/main?l=fr_FR (French)

I added a robots.txt to stop Google from crawling any of the language translations:

# robots.txt generated at http://www.mcanerin.com
User-agent: *
Disallow: 
Disallow: /cgi-bin/
Disallow: /*?l=

So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works.

But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it.

I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist.

This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS)

https://example.com/reports/view/884?l=vi_VN&l=hy_AM

This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123

It seems that parameters always load as long as everything before the question mark is a real URL.

So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.

Stephen Ostermiller
  • 98,758
  • 18
  • 137
  • 361
Doug Fir
  • 388
  • 3
  • 16

3 Answers3

2

All permalink will be index by SE, exclude permalink have a symbol ?, =, and &.

User-agent: *
Disallow: 
Disallow: /cgi-bin/
Disallow: /*?*
Disallow: /*?
Disallow: /*=*
Disallow: /*=
Disallow: /*&*
Disallow: /*&
Allow: /

Will blocking all permalink format like /main?l=de_DE or /main?l=Any_Value_Here, excluding /main?l=fr_FR form SE.

User-agent: *
Disallow: 
Disallow: /cgi-bin/
Disallow: /main?l=*
Allow: /
Allow: /main?l=fr_FR
  • That would deindex all parameters, not just a value of de_DE while allowing fr_FR to be crawled and indexed. – Stephen Ostermiller Oct 24 '13 at 11:59
  • 1
    @Stephen Ostermiller - I had to edit my answer. – Kamarul Anuar Oct 24 '13 at 17:02
  • Thanks for the helpful information. But, due tot he navigation of the unique platform that my website is built upon I would not like to exclude all links that contain ?|=|& since some of them are paths to legit pages. It's only those that contain "?l=" that should be de-indexed – Doug Fir Oct 25 '13 at 15:30
1

I would rewrite your URLs so that the language is a directory:

/main
/fr_FR/main

That has two advantages:

  • Robots.txt can be used to block certain languages but not other (without resorting to wildcards)
  • You can add directories to Google webmaster tools and change the geographic targeting of the directories (in that case it should be geo targeted to france)

See: How should I structure my URLs for both SEO and localization?

If you can't do that, you could noindex content based on parameter values. The easiest option is to use a meta robots noindex tag. You could put a tag in only when the parameter has a value that you don't want to have Google include in the index.

Stephen Ostermiller
  • 98,758
  • 18
  • 137
  • 361
  • Thanks for the info Stephen. I would except that the site is built on a platform and I have limited ability to edit on my own. I was hoping there would be something I could do within GWT or the Robots.txt – Doug Fir Oct 25 '13 at 15:33
0

You can do several things:

  1. is robots.txt, although take good care what you add in there as if can ALSO deindex some correct URLs. (i.e. if google did index hxxps://example.com/reports/view/884?l=fr_FR, you don't just want to lose it right?

Which brings me to the second:

  1. Since you want to "deindex" old URLs, why not 301 them to "correct" ones or return a hard 404 or 410 on the real incorrect ones.

  2. use rel="canonical" so you tell google what the correct page is (although it is "other" content)

  3. use google webmastertools and specify what a parameter is used for. For the l= parameter you could specify "translates"

Ronald Swets
  • 269
  • 1
  • 3
  • I thought that my robots.txt should work? The old URLs do not exist any more but they still load since it's a parameter and the link up till the question mark still works. How would I add a 404 or 410 to, for example https://example.com/reports/view/884?l=vi_VN&l=hy_AM without affecting https://example.com/reports/view/884 ? – Doug Fir Oct 25 '13 at 15:35
  • 1
    Since you said in another comment you cannot really change your code, adding a 404 or 410 would be of course more difficult then. If someone links to the url with l=vi_VN (or for any other URL) it will still be indexed, except when you put it in your robots.txt. In this case yes, just use the robots.txt approach – Ronald Swets Oct 25 '13 at 18:08