Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active April 27, 2022 08:56
Crawl multiple websites with one for loop, while saving the output, logs, and job status separately for each website. Resume crawling any time simply be re-running the same code
from urllib.parse import urlsplit
import advertools as adv
sites = [
'https://www.who.int',
'https://www.nytimes.com',
'https://www.washingtonpost.com',
]
for site in sites:
domain = urlsplit(site).netloc
adv.crawl(site,
output_file=domain + '.jl',
follow_links=True,
custom_settings={
'LOG_FILE': domain + '.log',
# change this to any number of pages
'CLOSESPIDER_PAGECOUNT': 50,
# resume the same crawl jobs later
'JOBDIR': domain
})
@eliasdabbas
Copy link
Author

Directory structure after running the above code:

Screen Shot 2022-03-28 at 1 58 09 PM

Line, word, character count for each file:

Screen Shot 2022-03-28 at 1 58 20 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment