Skip to content

Instantly share code, notes, and snippets.

@rushipkar90
Created December 13, 2015 03:32
Show Gist options
  • Save rushipkar90/3739a78e487f446a5dd5 to your computer and use it in GitHub Desktop.
Save rushipkar90/3739a78e487f446a5dd5 to your computer and use it in GitHub Desktop.
Bots investigation
Refer: http://www.inmotionhosting.com/support/website/server-usage/identify-and-block-bad-robots-from-website
How to identify bad bot for a domain
============
cd /home/xyystgkp/access-logs
cat justforflorida.com | awk -F\" '{print $6}' | sort | uniq -c | sort -n
>>
36 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
71 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
95 WordPress/3.5.1; http://justforflorida.com/florida
613 -
738 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
cat justforflorida.com | awk -F\" '$6 ~ "-"' | awk '{print $1}' | sort -n | uniq -c | sort -n
>>
1 46.148.22.18
2 31.187.64.239
2 54.91.137.217
613 23.23.233.205
grep 23.23.233.205 justforflorida.com | awk -F\" '{print $6}' | sort | uniq -c | sort -n
whois 23.23.233.205
============
Block a bad robot
==================
block the entire range of 74.125 IPs we were seeing from accessing the example.com website, but still allow them to request if they do happen to mention Google in their User-Agent string of the request.
.htaccess file
-------------
ErrorDocument 503 "Site disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} !^.*(Google).*$
RewriteCond %{REMOTE_ADDR} ^74.125
RewriteRule .* - [R=503,L]
-------------
cat ~userna5/access-logs/example.com | grep "74.125" | awk '$9 ~ 503' | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":00"}' | sort -n | uniq -c | sed 's/[ ]*//'
==================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment