How to prevent a specific PDF file from being indexed by search engines in Wordpress?

Question

In a wordpress multisite installation, I would like to block indexing of a specific pdf file using HTTP X-Robots-Tag response header using htaccess file.

Any of the indications found in previous related questions (i.e. this one) didn't work in my case.

I used the instructions but all checks and every online checkers don't shows the headers I'd need

These are my htaccess configs

These are real headers in browser inspector

Which .htaccess file did you put that config in? The document root? wp-content/? wp-content/uploads/? — Stephen Ostermiller, Dec 01 '20 at 13:57
I'm not sure I trust the browser inspector to show headers. I've never seen headers embedded in HTML like that before. Even if they are legit, it could be using a cached copy from before you added the htaccess rule. I tend to test using curl on the command line like curl --head http://example.com/xxxxxxx/wp-content/uploads/sites/7/2020/11/Xxxxxxxxxxxxx.pdf — Stephen Ostermiller, Dec 01 '20 at 14:00
Also, did you disallow the pdf in robots.txt? If so, it doesn't matter what your headers are because Googlebot will never crawl it and see the headers. — Stephen Ostermiller, Dec 01 '20 at 14:03
@StephenOstermiller the htaccess file is the one in the directory of the multisite wp installation. I mean it is in /nameoffolder/ while the pdf file is in /nameoffolder/wp-content/uploads/sites/...... — bobrock4, Dec 01 '20 at 15:37
@StephenOstermiller I've just tried also with curl obtaining the same headers — bobrock4, Dec 01 '20 at 15:51
"I've never seen headers embedded in HTML like that before." - This would seem to be the "HTML output (Elements)" when "inspecting" the PDF in Chrome's built-in PDF viewer. I think this output is "unique"(?) to the PDF viewer (or plugins?). It would be preferable to use the "Network" tab in the Browser Inspector and refresh the content with the Inspector open (and cache disabled). — MrWhite, Dec 01 '20 at 15:51
@StephenOstermiller yes, i already disallowed the pdf in robots.txt — bobrock4, Dec 01 '20 at 16:01
@bobrock4 You should not disallow the pdf in robots.txt (which is basically what Stephen is saying). — MrWhite, Dec 01 '20 at 16:15

MrWhite · Accepted Answer · 2020-12-01T16:35:44.857

The <Files> directive only applies to filenames, not file-paths, so your <Files> directive will never match and the header will not be set.

To set this header on a specific file (and not all .pdf files - as in the linked question/answers) in the root .htaccess file then you can set an environment variable when this file is requested and conditionally set the header based on this env var.

For example:

SetEnvIf Request_URI "/path/to/example.pdf" NOINDEX
Header set X-Robots-Tag "noindex, nofollow" env=NOINDEX

Alternatively, if you could place an additional .htaccess file in the directory that contains the PDF file you want to target, then you could use a <Files> directive in that .htaccess file:

<Files "example.pdf">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

You could use this same method in the root .htaccess file, but it will also add the header to all example.pdf file requests on the system - although it's probably unlikely you have more than one file with the same name anyway I would think, so this may be the better solution after all.

could this htaccess in the folder containing only the instruction <Files......../Files> ? Anything else? — bobrock4, Dec 01 '20 at 16:17
Yes, just the <Files>....</Files> container as mentioned above. I've removed the env= argument from that header directive - sorry, copy/paste error! — MrWhite, Dec 01 '20 at 16:37

How to prevent a specific PDF file from being indexed by search engines in Wordpress?

1 Answers1

Linked