1

I have a website - www.example.com - which is listed on Google, domain verified in Search Console, sitemaps submitted, canonical urls all in line. This is my only domain that is "out there" in the wild.

www.example.com is configured in DNS using a CNAME which points to another domain, which finally points to an IP. As follows:

www.example.com -> CNAME = example.cloudhost.com -> A = 192.0.2.255

For dev/test/debugging reasons I have a binding on my web server which will respond to example.cloudhost.com, which means if you navigate to any url on that domain, you'll get the exact same responses as www.example.com, including all canonical tags pointing back to www.example.com

example.cloudhost.com has not been published ANYWHERE on the Internet, let alone given directly to Google. It is not exposed via any html tags, sitemaps, Search Console submissions, social media posts, 3rd party websites. Nothing.

--- edit ---

I now realise it could be leaked via http referrer or analytics trackers running on example.cloudhost.com when used for testing.

---

However, somehow Google has indexed example.cloudhost.com as a duplicate domain to www.example.com.

How did Googlebot get hold of example.cloudhost.com to start indexing it and is there any documentation of its methods? Does Googlebot look at DNS CNAME chains and speculatively try to crawl content on those intermediate domains?

Stephen Ostermiller
  • 98,758
  • 18
  • 137
  • 361
theyetiman
  • 123
  • 4

1 Answers1

2

I highly doubt that the CNAME is the only reference to example.cloudhost.com. There are many ways that URLs leak. See Can a URL that is not linked to from anywhere be discovered? for a more complete list but it includes:

  • Emails with the link sent through Gmail
  • Third party JS used on your site, especially analytics or ads which send data to Google
  • External links from your site that send a referrer that is subsequently published by the other site.

It is possible that Googlebot looks at the hostname in the CNAME chain to discover the site. I've not seen evidence that Googlebot does this, but it wouldn't surprise me if it did. Googlebot does a lot of speculative crawling:

  • Google watches new domain name registrations and tries to crawl all newly registered sites.
  • Googlebot picks strings out of JS files that look like they might be URLs and crawls those. Pretty much any string literal with no spaces and a slash will get hit by Googlebot.
  • Googlebot submits simple forms with some random options just to see if there is content behind the form.

The bottom line is that you can't rely on secrecy to keep your URLs out of Google. Googlebot is very good at finding private URLs. See Can secret URLs be used to protect files?

There are several ways to fix this:

  1. Include canonical tags in each of your pages pointing to the full URL of the page on www.example.com. Then even if Googlebot crawls the cloud host URL, it will know to index the URL on your domain. Note that canonical tags won't stop Googlebot from crawling the entire alternate domain. Googlebot will even come back and crawl the alternate domain many times. The canonical tags will almost always cause Google to index each page on your preferred domain instead.
  2. Reconfigure your server to redirect the cloud host URL to your domain. You said that you had it configured this way for testing and development, but presumably now that the domain is live, you don't need that. If you do still need it, you could do the redirect except for your own IP address.
  3. Configure your server to serve different robots.txt files for the different host names. Disallow all crawling for the cloud host domain name.
  4. Configure your server to password protect the site on the cloud host domain name. It is usually fairly easy to set up digest authentication that can do that.

Let me put it this way: I've always had Googlebot come crawl my development and staging URLs if I leave them open to the public. It doesn't matter how secret I think I've kept them. Even if I'm the only one that ever uses them, Googlebot seems to find them. I always protect them now, either with a password, or by doing my development on a private network.

Stephen Ostermiller
  • 98,758
  • 18
  • 137
  • 361
  • You're right I can't assume my domain is secret, and I take your points about leakage - mainly 3rd party/analytics sent to GA. My point was more "how does Googlebot discover domains, & is there docs/community acknowledgement of how it does so?" If Googlebot uses GA data or CNAME chains to learn domains, that would be good to officially know.

    Fix 1 is not quite correct. I use canonical tags and Google still indexed the other domain. I think canonical tags are just a hint - Google does as it pleases in the end.

    Other points very valid (I solved my issue by removing unneeded dev binding).

    – theyetiman Jul 02 '20 at 12:07
  • 1
    You are correct, canonicals are just a hint, and Google is ignoring them more often than ever these days. They should be very effective eventually. The content is 100% the same, Google will see actual duplicates. Your main domain will eventually have a lot more reputation than the cloud host subdomain, so Google will see the canonical and it will be a no-brainer for their bot to obey it. Since your site is new, that may not be the case and Google may see content on the cloud host domain and being more likely to be authoritative. – Stephen Ostermiller Jul 02 '20 at 12:19
  • 1
    Google rarely releases official documentation or talks about specifics of how their crawler works. At best we could run an experiment and see if Googlebot crawls a host name that we make up and are very careful to make sure is only ever mentioned in a DNS CNAME. I don't know of anybody that has done that experiment so far though. – Stephen Ostermiller Jul 02 '20 at 12:22
  • Interesting info. I should have clarified... www.example.com already has a very good domain reputation for upwards of 20 years, the CNAME example.cloudhost.com is much newer (<1yr) so it's interesting that it got indexed at all. The worry is that it will be seen as duplicate content and actually detract from our main domain's reputation. – theyetiman Jul 02 '20 at 12:23
  • 1
    How are you verifying that the cloud host subdomain is indexed? Are you using site: queries in Google? They don't show which domain is actually indexed. If you search for site:example.cloudhost.com Google will show results that are not actually indexed on that domain but which look like they are in that search result. If you search for phrases from the pages in quotes you should see which domain is actually indexed. You could also add the cloudhost subdomain to Google Search Console and get stats about what Google has actually indexed on it. – Stephen Ostermiller Jul 02 '20 at 12:28
  • 2
    For a 20 year old domain and canonical URLs in place, I wouldn't worry about the reputation of your site suffering. Scraper sites pop up all the time with copies of sites and no canonicals. Google is usually good about preferring the original site and not penalizing it in any way. – Stephen Ostermiller Jul 02 '20 at 12:29
  • 1
    re: verifying indexation. It actually popped up whilst searching for specific "quoted terms" rather than using site:example.cloudhost.com but that's super useful to know it will show things that aren't indexed! – theyetiman Jul 02 '20 at 12:41
  • 1
    I'll also add that I have one public development site. I keep it public because I have some API users that need access to it. It has tens of thousands of pages with canonical URLs in place. Googlebot was crawling it so much I ended up doing some light cloaking on it. I put <base href="https://example.com/"> into the head of every page when the user agent is a bot. That way when bots find one page on the dev site, all the links on that page point to the live site. That really reduces the amount that bots (especially Googlebot) end up crawling on the dev site. – Stephen Ostermiller Jul 02 '20 at 12:43
  • That's really useful, thanks. – theyetiman Jul 02 '20 at 15:52