1

Well, robots.txt prevent crawling and meta robots tag in HTML (or) X-Robots-Tag HTTP header prevents indexing (and other functionalities available too).

So, even when a URL is disallowed in robots.txt, it can be indexed in Google if it is referenced somewhere using an anchor tag.

So, my question is, how can I prevent public CDN URL being indexed by Google?

For example, I upload an image on Facebook which is private. But, the CDN URL holding the data is actually public, so anyone with the complete link can access the content. How does Facebook prevent these URLs being indexed?

Example textual URL: https://scontent.fmaa1-2.fna.fbcdn.net/v/t1.0-9/21616414_867163746798211_8462429064810946636_n.jpg?oh=8964e784c64486e307a0fac58e66d79a&oe=5A3C658E

Now, I posted the above URL here. Will google index this URL? Update:Yes, this is indexed as a text content

How about this anchor URL: FBCDNLINK.

Will this be indexed if this is referenced this way?

Update:

When I say "link indexing", I'm not talking about this kind of content search indexing but like this kind of link indexing.

Example: This URL is actually disallowed in the site's robots.txt but you can see that it is actually indexed.

My question is, how does Facebook prevent the FBCDN URL above being indexed while you search like this in Google: site:fbcdn.net inurl:21616414_867163746798211_8462429064810946636_n.jpg

This didn't work, and my question is how does Facebook do this.

Reference 1: https://www.youtube.com/watch?v=KBdEwpRQRD0

Reference 2: https://support.google.com/webmasters/answer/6062608?hl=en

Reference 3: http://tools.seobook.com/robots-txt/ (Check the table at bottom)


TestDingDongBellPleaseIgnoreThis

  • Google can't index something if its not allowed to crawl it. Fact is example.com can not tell Google not to index webmasters.stackexchange.com. There is no robots on fbcdn.net, it redirects to facebook.com, therefore it does not have a robots.txt https://scontent.fmaa1-2.fna.fbcdn.net/robots.txt is set to allow... there's not much of a question here. – Simon Hayter Sep 21 '17 at 11:28
  • No, you're wrong. Please check the three reference I've posted. robots.txt can't prevent indexing completely. – verstappen_doodle Sep 21 '17 at 11:29
  • I meant to say its set to disallow, which is good... but can you show me where that image is indexed? – Simon Hayter Sep 21 '17 at 11:29
  • I see that
    scontent.fmaa1-2.fna.fbcdn.net/robots.txt is set to disallow. https://www.google.co.in/search?q=site:fbcdn.net+inurl:jpg&num=100&filter=0&biw=1535&bih=810
    – verstappen_doodle Sep 21 '17 at 11:30
  • Could someone please add https:// to all the URLs in my question? I'm unable to do it as it requires reputations. The question is not meaningful without valid anchor markdowns. – verstappen_doodle Sep 21 '17 at 11:35
  • Then what's the issue? Google sometimes provides sample data for usages of site: and only includes A description for this result is not available because of this site's robots.txt when few real results are available. The only reason it displays that image you linked is because the image is not JPG, it is .HTML if you lose more closely. There is no way to prevent Google, especially on a domain that you don't own or control. No keyword or filename search will ever reveal your images... – Simon Hayter Sep 21 '17 at 11:38
  • FbCDN - I said for an example. I run a company and host my user data in a CDN. But, they got indexed in Google, even though disallowed in robots.txt. That made me to post here. (Sorry, couldn't give the exact CDN link, as it my customer data). And, filename search gives the result in Google in my case. :( https://www.google.co.in/search?q=site:fbcdn.net+inurl:jpg&num=100&filter=0&biw=1535&bih=810 Many of them are pure images here. Not HTML (except the first one) – verstappen_doodle Sep 21 '17 at 11:40
  • It's not indexed through... A description for this result is not available because of this site's robots.txt implies that the URL has been discovered but not crawled. If you embed a element on a page somewhere, its hardly secret and Google discover it. If images or other resources are sensitive then you should use authentication to block the resource being accessed. Keyword searches will never return a A description for this result is not available because of this site's robots.txt. But unless your customer doesn't care about his/her rankings then you should never block a CDN in anycase. – Simon Hayter Sep 21 '17 at 11:46
  • It is indexed but not actually crawled. That's why I've asked you to see all references in my question. That'll answer what is my real issue here. The content indexed is images in my case, and I don't know how to prevent it. Many of my links in format [] () is not visible, please edit them. I couldn't do it. That will make the question more readable. – verstappen_doodle Sep 21 '17 at 11:51
  • 1
    The URL is indexed, the content is not indexed. Hence no cache, no description, no rankings. X-Robots-Tag: noindex, nofollow is more reliable than robots.txt and should remove the URLS from the index. – Simon Hayter Sep 21 '17 at 11:54
  • 1
    The issue is, suppose my cdn url is unixrootcdn.net and with site: and inurl: search in Google, it reveals one of my customer's image data, and he reported it to me. CDNs are public, and anyone with the complete link can access it (URL has a hashed key - works like FbCDN). I can't authenticate this. X-Robots-Tag: noindex, nofollow requires me to allow robots.txt crawling, and I feel it is dangerous. There are cases where some bots don't obey robots meta or HTTP header. – verstappen_doodle Sep 21 '17 at 11:55
  • Then you have the resource embedded somewhere on a public page.. hardly a secret. If you want it blocked entirely do not embed it on a public page. Google only discovers what it has access to, furthermore use X-Robots-Tag: noindex, nofollow on the header status of the CDN, and on the public website you can use X-Robots-Tag: noimageindex – Simon Hayter Sep 21 '17 at 11:57
  • You use both robots.txt and X-Robots-Tag: noindex, nofollow on the CDN, that will block resources being indexed. But, be aware blocking resources on a CDN can and does affect SEO on the public website, don't expect to be rewarded for content that it can not scan. – Simon Hayter Sep 21 '17 at 11:58
  • 1
    Having robots.txt with disallowed control won't let Google bots to see robots header or meta. Using the latter requires me to enable crawling in robots.txt. That will open new set of issues when malicious bots crawl the data. – verstappen_doodle Sep 21 '17 at 11:59
  • 1
    Yea, I see your point. X-Robots-Tag HTTP header: Use it if you need to control how non-HTML content is shown in search results (or to make sure that it's not shown). So you could remove the robots.txt and let Google find the noindex but Google will crawl but not index, but since images do not link from one to another, there is no crawling. Only JS/HTML/PHP can be crawled, but again, no index will occur on the CDN. – Simon Hayter Sep 21 '17 at 12:04
  • Can you please edit the question with proper links btw? – verstappen_doodle Sep 21 '17 at 12:09
  • Done and btw, using nofollow with noindex will prevent discovering new resources on the CDN too, so ignore the part where i said JS/HTML/PHP etc. Also, add the CDN to your Google Search Console as a new property, that way you can watch the behaviour. – Simon Hayter Sep 21 '17 at 12:20

0 Answers0