3

I was recently asked to create a sitemap for an older Apache website with over 20,000 generated database pages. However, halfway through the project I found out that the client neglected to mention that the total inventory was actually around 12 million+.

The client wouldn't listen to my concerns about the impracticality of a sitemap that large, and kept insisting that he needed to catalog all of the pages to get more page views. He kept citing the fact that it is possible to create multiple sitemaps, and that the max limit for a single sitemap is 50,000 pages.

I tried out several freeware sitemap generators, and the best option ran out of memory and crashed at 20,000 pages. Most commercial sitemap generator products also tended to have a maximum capacity of 5,000. I could only find one product claiming to be capable of crawling more than 1 million pages, but I wasn't being paid enough to spend months waiting for a crawler to slog through 12 million pages.

I cancelled the job and tried providing the client with some alternative SEO resources that would do a lot more to increase page views, but it really bothers me that I was caught so off guard by a freelance request. I try to specialize in SEO and WordPress maintenance, and I'm accustomed to handling larger sites with a few hundred pages, or even a few thousand pages. In WordPress it usually takes me less than 10 minutes to generate a sitemap.

I did try researching the subject further, and apparently it is possible to program custom script that will crawl and catalogue over a million pages. It doesn't really seem to be a productive use of time or resources, nor is it particularly beneficial to SEO. But it has been done before.

I guess I am just curious to know if it is common in the web industry to encounter websites requiring massive sitemaps over 20k? Or was I just too inexperienced to deal with the programming requirements? 12 million seemed like an unreasonable number to me, and I couldn't determine any advantages such a sitemap would have for the website in question.

  • Not an answer, but similar question on Stack Overflow: XML Sitemap Generator for URL with 1.5 million pages? – Steve Jun 13 '17 at 00:50
  • 2
    I have created scripts that generated sitemaps for a site with nearly 2 million pages. Site maps only benefit extremely large sites that cannot be crawled properly or for sites with content behind a paywall or login. Creating a sitemap the way you are is not practical. It requires coding. Short of allowing pages to be found by search engines, there is absolutely no SEO value for having a sitemap. None whatsoever. Cheers!! – closetnoc Jun 13 '17 at 01:03
  • BTW- If you can write code and depending upon how the site is crated, I am assuming a database of some kind, it should be trivial to write code to generate a large scale sitemap. I wrote mine in less than an hour but then again, I coded my own CMS so it was just a matter of doing a query and then generating the files. – closetnoc Jun 13 '17 at 01:24
  • "needed to catalog all of the pages to get more page views" -- Absolutely FALSE. Sitemaps have almost no effect on rankings. At best they are a diagnostic tool that gives you additional insight into the pages they contain. See The Sitemap Paradox – Stephen Ostermiller Jun 13 '17 at 09:00
  • Like @closetnoc I'd recommend generating the sitemap based on a database query rather than based on crawling the site. I've implemented sitemaps with 200 million URLs that regenerate daily from the database. – Stephen Ostermiller Jun 13 '17 at 09:03
  • Yeah I tried explaining that sitemaps don't really work that way and that addressing SERP, keywords, and domain authority would be far more beneficial for SEO and SEM than a sitemap indexing a ton of low-priority product pages. I guess it is sometimes difficult to convey all the technicalities of how search engines work in concise emails. I recommended that he could try to find a programmer who could create a custom query script for his site since freeware and commercial license crawlers didn't seem to be a viable option for the website scope. – jcongerkallas1 Jun 13 '17 at 14:22
  • Thanks for the tips. Do you happen to know what language the query script would need to be in, and what page the code would need to be placed on? The site files were PHP. I would assume the database was SQL, but I didn't really get a chance to look. – jcongerkallas1 Jun 13 '17 at 14:26
  • I personally use other languages, however, I would strongly recommend PHP because it is already installed, because it is easy to find a programmer for that language, and lastly, there may be some existing inherent benefits using a language used to code the site. – closetnoc Jun 13 '17 at 15:19

1 Answers1

4

You should submit a sitemap of all pages you believe should be in Google's search index.

If you have millions of pages, you'll need to use the sitemap index which is a collection of individual sitemap files.

http://sitemaps.org/protocol.php

Each sitemap must have no more than 50,000 URLs and no larger than 10MB.

To get the most out of this protocol I'd suggest creating sitemaps that map to a category or page type (or combination) so you can determine indexation rates. You often find that some pages or categories are better indexed than others. It's then your job to figure out why.

https://www.quora.com/If-I-have-a-website-with-millions-of-unique-pages-should-I-submit-a-partial-sitemap-to-Google

Steve
  • 8,448
  • 23
  • 29