I have a website - www.example.com - which is listed on Google, domain verified in Search Console, sitemaps submitted, canonical urls all in line. This is my only domain that is "out there" in the wild.
www.example.com is configured in DNS using a CNAME which points to another domain, which finally points to an IP. As follows:
www.example.com -> CNAME = example.cloudhost.com -> A = 192.0.2.255
For dev/test/debugging reasons I have a binding on my web server which will respond to example.cloudhost.com, which means if you navigate to any url on that domain, you'll get the exact same responses as www.example.com, including all canonical tags pointing back to www.example.com
example.cloudhost.com has not been published ANYWHERE on the Internet, let alone given directly to Google. It is not exposed via any html tags, sitemaps, Search Console submissions, social media posts, 3rd party websites. Nothing.
--- edit ---
I now realise it could be leaked via http referrer or analytics trackers running on example.cloudhost.com when used for testing.
---
However, somehow Google has indexed example.cloudhost.com as a duplicate domain to www.example.com.
How did Googlebot get hold of example.cloudhost.com to start indexing it and is there any documentation of its methods? Does Googlebot look at DNS CNAME chains and speculatively try to crawl content on those intermediate domains?
Fix 1 is not quite correct. I use canonical tags and Google still indexed the other domain. I think canonical tags are just a hint - Google does as it pleases in the end.
Other points very valid (I solved my issue by removing unneeded dev binding).
– theyetiman Jul 02 '20 at 12:07www.example.comalready has a very good domain reputation for upwards of 20 years, the CNAMEexample.cloudhost.comis much newer (<1yr) so it's interesting that it got indexed at all. The worry is that it will be seen as duplicate content and actually detract from our main domain's reputation. – theyetiman Jul 02 '20 at 12:23site:queries in Google? They don't show which domain is actually indexed. If you search forsite:example.cloudhost.comGoogle will show results that are not actually indexed on that domain but which look like they are in that search result. If you search for phrases from the pages in quotes you should see which domain is actually indexed. You could also add the cloudhost subdomain to Google Search Console and get stats about what Google has actually indexed on it. – Stephen Ostermiller Jul 02 '20 at 12:28"quoted terms"rather than usingsite:example.cloudhost.combut that's super useful to know it will show things that aren't indexed! – theyetiman Jul 02 '20 at 12:41<base href="https://example.com/">into the head of every page when the user agent is a bot. That way when bots find one page on the dev site, all the links on that page point to the live site. That really reduces the amount that bots (especially Googlebot) end up crawling on the dev site. – Stephen Ostermiller Jul 02 '20 at 12:43