-
-
Save geerlingguy/a438b41a9a8f988ee106 to your computer and use it in GitHub Desktop.
<?php | |
/** | |
* Check if the given user agent string is one of a crawler, spider, or bot. | |
* | |
* @param string $user_agent | |
* A user agent string (e.g. Googlebot/2.1 (+http://www.google.com/bot.html)) | |
* | |
* @return bool | |
* TRUE if the user agent is a bot, FALSE if not. | |
*/ | |
function smart_ip_detect_crawler($user_agent) { | |
// User lowercase string for comparison. | |
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']); | |
// A list of some common words used only for bots and crawlers. | |
$bot_identifiers = array( | |
'bot', | |
'slurp', | |
'crawler', | |
'spider', | |
'curl', | |
'facebook', | |
'fetch', | |
); | |
// See if one of the identifiers is in the UA string. | |
foreach ($bot_identifiers as $identifier) { | |
if (strpos($user_agent, $identifier) !== FALSE) { | |
return TRUE; | |
} | |
} | |
return FALSE; | |
} |
@FarrisFahad the big problem with this code is that it does not verify the bot as legitimate.
People can just spoof the User-Agent.
Current Facebook header:
User-Agent: cortex/1.0
X-FB-HTTP-Engine: Liger
X-FB-Client-IP: True
X-FB-Server-Cluster: True
whats the point of passing $user_agent
as an argument and after that overriding it with
$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
I guess you dont need that argument.
Also why name the function smart_ip_detect_crawler()
. It does not have anything to do with IP. Maybe you could name it just smart_detect_crawler()
.
This script is not meant to be the be-all-and-end-all of bot detection ;)
Hopefully it's helpful if you're working on your own system, and as @FinlayDaG33k mentioned... it's trivial to bypass this. It was originally used just as a metric for how much traffic was coming to a certain page was bots like GoogleBot and Bing's bot.
whats the point of passing
$user_agent
as an argument and after that overriding it with$user_agent = strtolower($_SERVER['HTTP_USER_AGENT']);
I guess you dont need that argument.
I think he just forgot to clean that up.
Checking for google bot by reverse dns lookup: https://github.com/kalmargabor/crawler-check/blob/master/src/CrawlerCheck/CrawlerCheck.php
How much of a proof is this code? Because I heard that bot detection cannot be 100% proof.