Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active November 11, 2024 21:37
Show Gist options
  • Save eliasdabbas/169cc580f8d10a63d5a5d3df04ef9758 to your computer and use it in GitHub Desktop.
Save eliasdabbas/169cc580f8d10a63d5a5d3df04ef9758 to your computer and use it in GitHub Desktop.
Get the most up-to-date list of IP addresses for crawler bots, belonging to Google and Bing.
import ipaddress
import requests
import pandas as pd
def bot_ip_addresses():
bots_urls = {
'google': 'https://developers.google.com/search/apis/ipranges/googlebot.json',
'bing': 'https://www.bing.com/toolbox/bingbot.json'
}
ip_addresses = []
for bot, url in bots_urls.items():
bot_resp = requests.get(url)
for iprange in bot_resp.json()['prefixes']:
network = iprange.get('ipv4Prefix')
if network:
ip_list = [(bot, str(ip)) for ip in ipaddress.IPv4Network(network)]
ip_addresses.extend(ip_list)
return pd.DataFrame(ip_addresses, columns=['bot_name', 'ip_address'])
@kstubs
Copy link

kstubs commented Apr 25, 2023

Have you discovered any .json resources for other search engines like DuckDuckGo, or Yahoo?
I appreciate your effort here, even though I'm a C# guy, I'm just looking for anyone else who is doing something similar to what I need which is a bot whitelist.

@eliasdabbas
Copy link
Author

@kstubs
I've tried to do the same thing, but couldn't find a similar place where the IPs get updated and you can simply get the latest list:

https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot

I just tried and the IPs can be scraped with this simple code:

import requests
from bs4 import BeautifulSoup
resp = requests.get('https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/')

soup = BeautifulSoup(resp.text, 'lxml')

ddg_ip_list = [x.text for x in soup.select('.content li')]
ddg_ip_list
['20.191.45.212',
 '40.88.21.235',
 '40.76.173.151',
 '40.76.163.7',
 '20.185.79.47',
 '52.142.26.175',
 '20.185.79.15',
 '52.142.24.149',
 '40.76.162.208',
 '40.76.163.23',
 '40.76.162.191',
 '40.76.162.247']

Happy to look into others to make the list as comprehensive and up-to-date as possible.

Thanks!

@kstubs
Copy link

kstubs commented Apr 26, 2023

I've created a DuckDuckGo prefixes file here:
https://jsoneditoronline.org/#left=cloud.511273c830ca42a488778345c096f6a5
Unfortunately I do not see a way to grab this content programmatically from this site, but you can at least consume it and use it locally.

@eliasdabbas
Copy link
Author

That's cool.

The code I shared can be used for programmatically grabbing the content from the page they are listed on. (or any equivalent in another language).

@kstubs
Copy link

kstubs commented Jul 1, 2023

Nice! I'll consider scraping that page as well.

@johnmurch
Copy link

Might also find this repo of use https://github.com/AnTheMaker/GoodBots

@eliasdabbas
Copy link
Author

@johnmurch
Interesting. Keeping and updating a static list of IPs can be another useful approach.
Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment