3

i have a website with thousands of dynamic pages. I want to use the robots.txt file in order to dissalow certain url patterns corresponding to pages with duplicate content.

For example i have a page for article itemA belonging to category catA/subcatA, with URL:

/catA/subcatA/itemA

this is the URL that i want to be indexed from google.

this article is also visible via tagging in various other places in the web site. The URLs produced via tagging is like:

/tagA1/itemA

this URL i want NOT to be indexed from google. However i want to have indexed all tag listings:

/tagA1

so how can i achieve this? dissalow URLs of including a specific string with a '/' at the end?

/tagA1/ itemA - dissalow

/tagA1 - allow

thanili
  • 33
  • 2

3 Answers3

5

You should not use robots.txt to block duplicate content.

The first step is to stop linking to 'bad' URLs. Each article should have one, canonical URL. So for example the URL /tagA1/itemA should not exist. On your tag page that lists the articles, they should link to the preferred URL of /catA/subcatA/itemA.

If for some reason that is not possible, or you have links pointing to the 'bad' URLs from elsewhere, there are two possible solutions:

  1. 301 redirect the 'bad' URL to the 'good' one. This could be done via htaccess, especially if there are clear patterns for the redirects. This is the preferred solution.
  2. Use the "rel=canonical" tag. Details in Google help files
DisgruntledGoat
  • 21,588
  • 5
  • 54
  • 101
1
User-agent: Google
Disallow: /tagA1/
allow: /tagA1

If you use this, It will disallow all the pages comes after tagA1/ and not tagA

Learn more information about robots.txt from http://www.robotstxt.org/robotstxt.html

Sathiya Kumar V M
  • 2,913
  • 4
  • 20
  • 31
1

A different approach ,

If you are using any CMS(Wordpress,Joomla etc) every CMS have separate page for listing tags and tags/with item.

So you can simple use canonical urls or nofollow,noindex option with meta tags.

You already mention you have thousands of dynamic urls so its better to use meta tags with nofollow,noindex based on your requirement in each pages.

Hope its make sense.

Jobin
  • 694
  • 4
  • 9
  • Yeap i am using Joomla and i am thinking of setting index,nofollow to my tag listing pages. Since i only need not to follow and index their individual items and not the actual listing. I was not sure though about crawlers behavior regarding the 'nofollow' metatag. Does it actually prevents them from indexing the outgoing link URLs? – thanili Nov 11 '13 at 09:30
  • if you are using Joomla then you can simple manage its from article options settings.In you list page you have set like index,nofollow and tags/details page you can noindex/nofollow. those have different layout in Joomla. – Jobin Nov 11 '13 at 09:35
  • sure that will work. but you have keep in mind on this case the direct to taga1/item1 should be noindex. – Jobin Nov 11 '13 at 09:39
  • this is what i want to achieve. to have indexed URLs: /catA/itemA, /tagA1 and NOT to have indexed only URLs of the form tagA1/itemA – thanili Nov 11 '13 at 09:41
  • You shouldn't use 'nofollow' anywhere on your own site. If you go with that solution (which I don't think you should) do 'noindex,follow'. – DisgruntledGoat Nov 11 '13 at 13:02
  • in fact i am implementing canonical URLs! This seems to be a much more convenient and solid solution since it keeps link juice for all URLs to the content. So for each unique content page source i set the into its head section... – thanili Nov 11 '13 at 14:55
  • that is good :) – Jobin Nov 11 '13 at 15:05