Well, robots.txt prevent crawling and meta robots tag in HTML (or) X-Robots-Tag HTTP header prevents indexing (and other functionalities available too).
So, even when a URL is disallowed in robots.txt, it can be indexed in Google if it is referenced somewhere using an anchor tag.
So, my question is, how can I prevent public CDN URL being indexed by Google?
For example, I upload an image on Facebook which is private. But, the CDN URL holding the data is actually public, so anyone with the complete link can access the content. How does Facebook prevent these URLs being indexed?
Example textual URL: https://scontent.fmaa1-2.fna.fbcdn.net/v/t1.0-9/21616414_867163746798211_8462429064810946636_n.jpg?oh=8964e784c64486e307a0fac58e66d79a&oe=5A3C658E
Now, I posted the above URL here. Will google index this URL? Update:Yes, this is indexed as a text content
How about this anchor URL: FBCDNLINK.
Will this be indexed if this is referenced this way?
Update:
When I say "link indexing", I'm not talking about this kind of content search indexing but like this kind of link indexing.
Example: This URL is actually disallowed in the site's robots.txt but you can see that it is actually indexed.
My question is, how does Facebook prevent the FBCDN URL above being indexed while you search like this in Google: site:fbcdn.net inurl:21616414_867163746798211_8462429064810946636_n.jpg
This didn't work, and my question is how does Facebook do this.
Reference 1: https://www.youtube.com/watch?v=KBdEwpRQRD0
Reference 2: https://support.google.com/webmasters/answer/6062608?hl=en
Reference 3: http://tools.seobook.com/robots-txt/ (Check the table at bottom)
example.comcan not tell Google not to indexwebmasters.stackexchange.com. There is no robots on fbcdn.net, it redirects to facebook.com, therefore it does not have a robots.txt https://scontent.fmaa1-2.fna.fbcdn.net/robots.txt is set to allow... there's not much of a question here. – Simon Hayter Sep 21 '17 at 11:28scontent.fmaa1-2.fna.fbcdn.net/robots.txt is set to disallow. https://www.google.co.in/search?q=site:fbcdn.net+inurl:jpg&num=100&filter=0&biw=1535&bih=810 – verstappen_doodle Sep 21 '17 at 11:30
https://to all the URLs in my question? I'm unable to do it as it requires reputations. The question is not meaningful without valid anchor markdowns. – verstappen_doodle Sep 21 '17 at 11:35site:and only includesA description for this result is not available because of this site's robots.txtwhen few real results are available. The only reason it displays that image you linked is because the image is not JPG, it is .HTML if you lose more closely. There is no way to prevent Google, especially on a domain that you don't own or control. No keyword or filename search will ever reveal your images... – Simon Hayter Sep 21 '17 at 11:38A description for this result is not available because of this site's robots.txtimplies that the URL has been discovered but not crawled. If you embed a element on a page somewhere, its hardly secret and Google discover it. If images or other resources are sensitive then you should use authentication to block the resource being accessed. Keyword searches will never return aA description for this result is not available because of this site's robots.txt. But unless your customer doesn't care about his/her rankings then you should never block a CDN in anycase. – Simon Hayter Sep 21 '17 at 11:46X-Robots-Tag: noindex, nofollowis more reliable than robots.txt and should remove the URLS from the index. – Simon Hayter Sep 21 '17 at 11:54X-Robots-Tag: noindex, nofollowon the header status of the CDN, and on the public website you can useX-Robots-Tag: noimageindex– Simon Hayter Sep 21 '17 at 11:57robots.txtandX-Robots-Tag: noindex, nofollowon the CDN, that will block resources being indexed. But, be aware blocking resources on a CDN can and does affect SEO on the public website, don't expect to be rewarded for content that it can not scan. – Simon Hayter Sep 21 '17 at 11:58X-Robots-Tag HTTP header: Use it if you need to control how non-HTML content is shown in search results (or to make sure that it's not shown).So you could remove the robots.txt and let Google find the noindex but Google will crawl but not index, but since images do not link from one to another, there is no crawling. Only JS/HTML/PHP can be crawled, but again, no index will occur on the CDN. – Simon Hayter Sep 21 '17 at 12:04