Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Created October 13, 2022 09:59
Show Gist options
  • Save eliasdabbas/5b80d89fe252ed19fd058a14a9adc0a4 to your computer and use it in GitHub Desktop.
Save eliasdabbas/5b80d89fe252ed19fd058a14a9adc0a4 to your computer and use it in GitHub Desktop.
Crawl a bunch of URLs using various combinations of request headers
import advertools as adv
import pandas as pd
pd.options.display.max_columns = None
headers_components = {
'User-agent': [
# Googlebot:
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
# iPhone 13:
'Mozilla/5.0 (iPhone14,3; U; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19A346 Safari/602.1',
# Windows10 with Edge browser:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
],
'Accept-Language': ['en', 'es']
}
request_headers = adv.serp._dict_product(headers_components)
amzn_urls = [
'https://www.amazon.com/s?i=garden',
'https://www.amazon.com/s?i=toys-and-games',
'https://www.amazon.com/s?i=baby-products',
'https://www.amazon.com/s?i=fashion-womens',
'https://www.amazon.com/s?i=electronics',
'https://www.amazon.com/s?i=kitchen'
]
for header in request_headers:
adv.crawl(
url_list=amzn_urls,
output_file='amazon_crawl.jl',
follow_links=False,
custom_settings={
'DEFAULT_REQUEST_HEADERS': header,
'LOG_FILE': 'amazon_crawl.log'
})
amazon = pd.read_json('amazon_crawl.jl', lines=True)
@eliasdabbas
Copy link
Author

Raw data: https://bit.ly/3eqFqsO

Sample rows from crawl dataset:

Screen Shot 2022-10-13 at 11 59 59 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment