-
-
Save eliasdabbas/169cc580f8d10a63d5a5d3df04ef9758 to your computer and use it in GitHub Desktop.
import ipaddress | |
import requests | |
import pandas as pd | |
def bot_ip_addresses(): | |
bots_urls = { | |
'google': 'https://developers.google.com/search/apis/ipranges/googlebot.json', | |
'bing': 'https://www.bing.com/toolbox/bingbot.json' | |
} | |
ip_addresses = [] | |
for bot, url in bots_urls.items(): | |
bot_resp = requests.get(url) | |
for iprange in bot_resp.json()['prefixes']: | |
network = iprange.get('ipv4Prefix') | |
if network: | |
ip_list = [(bot, str(ip)) for ip in ipaddress.IPv4Network(network)] | |
ip_addresses.extend(ip_list) | |
return pd.DataFrame(ip_addresses, columns=['bot_name', 'ip_address']) |
@kstubs
I've tried to do the same thing, but couldn't find a similar place where the IPs get updated and you can simply get the latest list:
- Yandex: It seems you have to run a reverse DNS lookup on the IP address to make sure it's theirs
- Yahoo: uses Bing
- DDG: They seem to have a list in text format here
https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot
I just tried and the IPs can be scraped with this simple code:
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/')
soup = BeautifulSoup(resp.text, 'lxml')
ddg_ip_list = [x.text for x in soup.select('.content li')]
ddg_ip_list
['20.191.45.212',
'40.88.21.235',
'40.76.173.151',
'40.76.163.7',
'20.185.79.47',
'52.142.26.175',
'20.185.79.15',
'52.142.24.149',
'40.76.162.208',
'40.76.163.23',
'40.76.162.191',
'40.76.162.247']
Happy to look into others to make the list as comprehensive and up-to-date as possible.
Thanks!
I've created a DuckDuckGo prefixes file here:
https://jsoneditoronline.org/#left=cloud.511273c830ca42a488778345c096f6a5
Unfortunately I do not see a way to grab this content programmatically from this site, but you can at least consume it and use it locally.
That's cool.
The code I shared can be used for programmatically grabbing the content from the page they are listed on. (or any equivalent in another language).
Nice! I'll consider scraping that page as well.
Might also find this repo of use https://github.com/AnTheMaker/GoodBots
@johnmurch
Interesting. Keeping and updating a static list of IPs can be another useful approach.
Thanks for sharing!
Have you discovered any .json resources for other search engines like DuckDuckGo, or Yahoo?
I appreciate your effort here, even though I'm a C# guy, I'm just looking for anyone else who is doing something similar to what I need which is a bot whitelist.