15

For a website with dynamic content (new content is constantly being added), should I only include the newest content in the sitemap or should I include everything (with a sitemap index)? What are the best practices for sitemaps esp. for large sites?

Also, is there anyway to make google (and other search engines) only crawl the pages in the sitemap?

Thanks

Update:
Also, any idea how stackoverflow handle this? I'd like to know but unfortunately (also understandingly) they have blocked access to their sitemap.

Mee
  • 533
  • 1
  • 4
  • 11
  • 1
    How big is the site? There is a size limit for both robots.txt and the sitemap. Amazingly, many exceed both, which is why I'm asking. – Tim Post Aug 22 '10 at 18:27
  • @Tim, it's not really big for now (everything can fit in one sitemap), but I'm trying to plan ahead. – Mee Aug 23 '10 at 07:24

1 Answers1

13

Include all pages. The purpose of the XML sitemap is to tell the search engines about all of your content. Not just the new stuff.

From the sitemaps.org website (emphasis mine):

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling.

If you have a lot of content you can use multiple XML sitemaps.

If you have content that you don't want to have crawled or indexed you need to specifically tell the search engines not to crawl and index those pages. Use a robots.txt file to block any pages or directories that you do not wish to have crawled. You can also use a meta tag for that as well. But you cannot specify in an XML sitemap not to crawl unlisted pages.

John Conde
  • 86,255
  • 27
  • 146
  • 241
  • Thanks for your answer, I'll include everything in the sitemap. – Mee Aug 24 '10 at 00:00
  • Do you have a lib that can handle 50+k of pages? –  Oct 01 '10 at 01:11
  • Are those 50k+ pages in a database? – John Conde Oct 01 '10 at 02:08
  • You do not need to place every page of your site in a sitemap. A sitemap is useful for informing search engines about pages that are available for crawling. If the search engine can already see every crawlable page, and you're not adding information about "last modified", then there's zero reason to have one. – Django Reinhardt Mar 28 '14 at 13:19
  • 2
    This answer seems somewhat conflicting with http://webmasters.stackexchange.com/a/5151/30596. Quoting @John Mueller from Google, Using a Sitemap file won't reduce our normal crawling of your site. It's additional information, not a replacement for crawling. Similarly, not having a URL in a Sitemap file doesn't mean that it won't be indexed. – user Jun 14 '15 at 19:26
  • @buffer Actually both answers are mutually exclusive. If you use a sitemap, this is how you use it. But you don't have to use one. – John Conde Jun 15 '15 at 00:39
  • @JohnConde: You suggest that one should include "all" the content in sitemaps and that purpose of sitemap is to inform about "pages on the sites that are available for crawling.". One can conclude that if a page is not on sitemaps, it is probably not available for crawling. However the above comment from JohnMueller suggests that not having a URL in sitemap doesn't mean that it won't be indexed. (I'm using indexing & crawling interchangeably here). Sitemaps is supplemental, doesn't that mean you do not need to include "all" the content, it's just an additional hint on top of normal crawl. – user Jun 15 '15 at 04:21
  • I think you're reading deeper into my answer than necessary. if you use a sitemap you should include all of the content you want crawled and indexed. That's all this says. – John Conde Jun 15 '15 at 12:56