Why crawl the web when someone already does for us?
Demo walking through the common crawl data: https://tech.marksblogg.com/petabytes-of-website-data-spark-emr.html
- Multiple times a year, there is a new crawl of the "whole" web completed. This will create an 'index' called YYYY-WW (year+week combo)
- This index has a master list of all files that contain all crawls
- The content of that index paths file contains the 50k+ URL's of all the warc.gz files which make up the data in the index.
- For each
warc.gz
file, stream it, unzip it, and parse it using go-warc as shown by ccrawl
There is a simple example of crawling all indexes
looking for all *.au
domains and listing when they were first and last seen
acording to the common crawl corpus:
https://gist.github.com/Xeoncross/020c283e334a94539676f029e3039c86
aws s3 ls s3://commoncrawl/crawl-data/ | grep "CC-MAIN-$(date +%Y)" | tail -n 1
# or via JSON at: http://index.commoncrawl.org/collinfo.json