Skip to content

Instantly share code, notes, and snippets.

@hans2103
Last active April 23, 2025 09:40
Show Gist options
  • Select an option

  • Save hans2103/733b8eef30e89c759335017863bd721d to your computer and use it in GitHub Desktop.

Select an option

Save hans2103/733b8eef30e89c759335017863bd721d to your computer and use it in GitHub Desktop.
NGINX to block bad bots. (add Twenga|TwengaBot if you want to exclude them too)
if ($http_user_agent ~* (360Spider|80legs.com|Abonti|AcoonBot|Acunetix|adbeat_bot|AddThis.com|adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Curl|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|pycurl|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$|search.goo.ne.jp|SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com|siteexplorer.info|SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot|worldwebheritage.org|Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YandexBot|Yandex|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) {
return 410;
}
@philippeowagner
Copy link
Copy Markdown

Thanks for sharing this. Works like a charm but I would suggest to use HTTP 444 instead of 410.

@kenguish
Copy link
Copy Markdown

Thanks. You might want to add libcurl and libwww-perl too.

@extensionsapp
Copy link
Copy Markdown

extensionsapp commented Jul 31, 2017

These are good search bots. Why are they on the list?
Yahoo|YandexBot|Yandex|Twitterbot

@imagina
Copy link
Copy Markdown

imagina commented Mar 21, 2018

We found another bad bot scanning our servers: trovitBot

@dmitryd
Copy link
Copy Markdown

dmitryd commented Feb 19, 2019

@extensionsapp Yandex tend to be too aggressive.

@dronezzzko
Copy link
Copy Markdown

@jotapepinheiro
Copy link
Copy Markdown

@precogtyrant
Copy link
Copy Markdown

Hello,
Thanks for the code. However, it also contains Yahoo in the list. Does this mean Yahoo search engine's bot. I would rather not block that one ;)

@Vish-was
Copy link
Copy Markdown

Vish-was commented Sep 6, 2019

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) in my access.log

@Vish-was
Copy link
Copy Markdown

Vish-was commented Sep 9, 2019

Hi This won't work in the nginx.conf setting
also, I can manage to remove some bots via robots.txt

User-agent: MJ12bot
user-agent: SemrushBot
User-agent: Yandex
User-agent: YandexBot
User-agent: UptimeRobot
User-agent: AhrefsBot
User-agent: GoogleBot
User-agent: BingBot
Disallow: /

but some are still there
like GoogleBot, BingBot

@dmhendricks
Copy link
Copy Markdown

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) in my access.log

access_log off;
return 444;

@qaisjp
Copy link
Copy Markdown

qaisjp commented Apr 29, 2020

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) in my access.log

access_log off;
return 444;

If it says:

nginx: [emerg] "access_log" directive is not allowed here

Put the if block inside your location directive, as per https://nginx.org/en/docs/http/ngx_http_log_module.html#access_log:

Context: http, server, location, if in location, limit_except

@Inner-Creator
Copy link
Copy Markdown

Hey Guys, How about a piece of code that allows the bots listed in the .htaccess file to allow crawling my website and blocks all other bots that are not listed in the file. Is that even possible?

@hans2103
Copy link
Copy Markdown
Author

hans2103 commented Jul 4, 2023

@Small-Being change to logical if-statement to check if-not-in-list, instead of if-in-list

@Inner-Creator
Copy link
Copy Markdown

@hans2103 Thanks for the solution, it would be great if you could type in the piece of code i should apply in my WP .htaccess file. TIA 👍

@qaisjp
Copy link
Copy Markdown

qaisjp commented Jul 4, 2023

This gist is about nginx . If your WordPress instance makes use of .htaccess files, that's a different technology called Apache HTTP Server, sorry.

@Devastatia
Copy link
Copy Markdown

Here are some I block. Some may be duplicates of what you already have.

A lot of homebrew crawlers running on EC2 and other cloud hosts use HeadlessChrome.

SummalyBot, Mastodon, and Misskey are used to create a link preview when a user posts a link on a Mastodon instance. That wouldn't be so bad, except they send 200+ bots at the same time to verify one link.

facebookexternalhit is used for the same thing. I'm banned from Faecesbook, so I block their bot. 🤷

ByteSpider may be a legit search engine, but also the largest AI firm in China. These firms scrape websites for content to train AIs, which is IP theft IMO.

'HeadlessChrome',
'trendiction.de',
'Bytespider',
'ahrefs',
'okhttp',
'SemrushBot',
'cpp-httplib',
'aiohttp',
'Go-http-client',
'Ruby',
'curl',
'python-requests',
'facebookexternalhit',
'DataForSeoBot',
'Python',
'Mastodon',
'SummalyBot',
'got',
'Misskey',
'IonCrawl',
't3versions',
'Dataprovider.com'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment