Skip to content

Instantly share code, notes, and snippets.

@DerekHawkins
Created July 28, 2021 17:44
Show Gist options
  • Save DerekHawkins/c716b2a9153dfafa4f37b95bde4182e6 to your computer and use it in GitHub Desktop.
Save DerekHawkins/c716b2a9153dfafa4f37b95bde4182e6 to your computer and use it in GitHub Desktop.
import socket
log_file = pd.read_pickle('log.pkl')
log_file = log_file.ip_address.apply(lambda ip: socket.gethostbyaddr(ip)[0])
# Alternative
from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect()
validate = []
for crawl in log_file.user_agent:
data = {'valid':crawler_detect.isCrawler(crawl),
'bot_type':crawler_detect.getMatches()}
validate.append(data)
log_file[['valid', 'bot_type']] = pd.DataFrame(validate)[['valid', 'bot_type']]
log_file = log_file[log_file['valid']==True]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment