5

I'm going to block all bots except the big search engines. One of my blocking methods will be to check for "language": Accept-Language: If it has no Accept-Language the bot's IP address will be blocked until 2037. Googlebot does not have Accept-Language, I want to verify it with DNS lookup

<?php
gethostbyaddr($_SERVER['REMOTE_ADDR']);
?>

Is it ok to use gethostbyaddr, can someone pass my "gethostbyaddr protection"?

Ravindra S
  • 5,906
  • 12
  • 67
  • 103
ilhan
  • 8,311
  • 31
  • 114
  • 194
  • 1
    Sure -- DNS poisoning. The other concern is probably the robustness of your "white list" checking. Is "google" in the response good enough -- or do you actually check for the suffix of the domain to be ".google.com" (and is that even a valid test)? And, do you care about blocking everyone in the event your DNS goes down, times out, etc.? – opello Jun 20 '10 at 01:39
  • Reverse dons lookups don't give any protection I can configure whatever name I want. – MrTux May 23 '16 at 09:13
  • @opello DNS poisoning is difficult, it can take hours and it's success is not guaranteed and it required knowing the nameservers used by the victim. All an attacker needs to do is set a reverse host, no need for low level protocol attacks like dns poisoning. – John Jan 27 '20 at 00:35

5 Answers5

4
function detectSearchBot($ip, $agent, &$hostname)
{
    $hostname = $ip;

    // check HTTP_USER_AGENT what not to touch gethostbyaddr in vain
    if (preg_match('/(?:google|yandex)bot/iu', $agent)) {
        // success - return host, fail - return ip or false
        $hostname = gethostbyaddr($ip);

        // https://support.google.com/webmasters/answer/80553
        if ($hostname !== false && $hostname != $ip) {
            // detect google and yandex search bots
            if (preg_match('/\.((?:google(?:bot)?|yandex)\.(?:com|ru))$/iu', $hostname)) {
                // success - return ip, fail - return hostname
                $ip = gethostbyname($hostname);

                if ($ip != $hostname) {
                    return true;
                }
            }
        }
    }

    return false;
}

In my project, I use this function to identify Google and Yandex search bots.

The result of the detectSearchBot function is caching.

The algorithm is based on Google’s recommendation - https://support.google.com/webmasters/answer/80553

Worka
  • 73
  • 5
  • Welcome to SO! Eve if your answer is right, try to explain a little bit. In your case, besides, there are some more answers, so you should expose the Cons and Pros of your P.O.V. – David García Bodego Oct 24 '19 at 06:12
3

In addition to Cristian's answer:

function is_valid_google_ip($ip) {
    
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
    
    return preg_match('/\.googlebot|google\.com$/i', $hostname);
}

function is_valid_google_request($ip=null,$agent=null){
    
    if(is_null($ip)){
        
        $ip=$_SERVER['REMOTE_ADDR'];
    }
    
    if(is_null($agent)){
        
        $agent=$_SERVER['HTTP_USER_AGENT'];
    }
    
    $is_valid_request=false;

    if (strpos($agent, 'Google')!==false && is_valid_google_ip($ip)){
        
        $is_valid_request=true;
    }
    
    return $is_valid_request;
}

Note

Sometimes when using $_SERVER['HTTP_X_FORWARDED_FOR'] OR $_SERVER['REMOTE_ADDR'] more than 1 IP address is returned, for example '155.240.132.261, 196.250.25.120'. When this string is passed as an argument for gethostbyaddr() PHP gives the following error:

Warning: Address is not a valid IPv4 or IPv6 address in...

To work around this I use the following code to extract the first IP address from the string and discard the rest. (If you wish to use the other IPs they will be in the other elements of the $ips array).

if (strstr($remoteIP, ', ')) {
    $ips = explode(', ', $remoteIP);
    $remoteIP = $ips[0];
}

https://www.php.net/manual/en/function.gethostbyaddr.php

Syscall
  • 18,131
  • 10
  • 32
  • 49
RafaSashi
  • 15,412
  • 8
  • 78
  • 89
  • So all I have to do is set my reverse DNS to anything.googlebot and put up a referer as "Google" and you'll verify me as google ? – John Jan 27 '20 at 00:31
  • I don't now how easy it is to do that but you can try and let me know if the method can be improved. In the meantime you can find answers here: https://stackoverflow.com/a/5092951/2456038 – RafaSashi Jan 27 '20 at 01:26
  • it's very easy to do that if you own an IP or rent a server, couple of seconds work for an administrator. For security purposes you need to do a "gethostbyname()" after "gethostbyaddr()" and look if the hostname resolves back to the same IP. Only then you know it's origin is very likely microsoft/google/etc. Your regex is missing ( and ) around the two domains, it's only protecting the begin of googlebot and the end of google. – John Jan 27 '20 at 05:29
3

The recommended way by Google is to do a reverse dns lookup (gethostbyaddr) in order to get the associated host name AND then resolve that name to an IP (gethostbyname) and compare it to the remote_addr (because reverse lookups can be faked, too).

But beware, end lokups take time and can severely slow down your webpage (maybe check for user agent first).

Google also publishes a machine readable file containing the IP addresses of their crawlers, see the link below.

See:

MrTux
  • 30,335
  • 25
  • 102
  • 137
  • Google also posted this information here: https://support.google.com/webmasters/answer/80553?hl=en. Now, an interesting case has started to appear for me. I'm being visited by bots identifying themselves as Googlebot, but when doing a host lookup on the IP address it doesn't report what Google says `Host x.x.x.x.in-addr.arpa. not found: 3(NXDOMAIN)` yet when doing a WHOIS lookup the IP address belongs to Google. Could that be a Google customer running something on App Engine? – Htbaa Jan 17 '19 at 10:31
  • @Htbaa Yes, I have similar situation. Somebody from Google servers from Belgium impersonates to be Googlebot using Googlebot user agent. Fortunately Google published range of possible Googlebots ips, so it is possible now to run fast verification. See list here: https://developers.google.com/search/apis/ipranges/googlebot.json Link to this list is published here: https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot – Robert May 13 '22 at 10:00
1
//The function
function is_google() {
    return strpos($_SERVER['HTTP_USER_AGENT'],"Googlebot");
}
Cristian
  • 196,304
  • 62
  • 355
  • 262
  • 4
    "Googlebot" doesn't mean that it is the real Googlebot. – ilhan Jun 20 '10 at 01:31
  • 1
    Of course not, but it's not a big deal after all... what can do a user that fakes the user agent? Maybe create a Google clone, yeah, that would be a nice project. – Cristian Jun 20 '10 at 01:38
  • 2
    No big deal. All they can do is crawl your entire site, regurgitate it with better SEO than yours (since they've honed how to rank w/o having to worry about details like quality content), then they use their link farm w/ high PR to compete w/ you in Google ranking, on your own site content. – joedevon Oct 23 '10 at 08:31
0

How to verify Googlebot.

Marcel Korpel
  • 21,285
  • 5
  • 59
  • 80
  • In fact, that's a better method that mine. That's what I love from SO... you will learn something everyday. Thanks! – Cristian Jun 20 '10 at 02:12
  • @Christian – To be frank, I think yours is good enough. The price of a false-positive is very low, I think. I'm more worried about false-negatives in this case: ordinary people with a UA that somehow doesn't send an `Accept-Language` header (don't ask me which; a quick test revealed that curl doesn't send one). – Marcel Korpel Jun 20 '10 at 02:22
  • 1
    This is a "link-only answer", and the target page doesn't mention how to do this is PHP, as per the question's tags. – ashleedawg Aug 14 '19 at 11:41