The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.
As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.
- [ARC] Archived Crawl #1 - s3://commoncrawl/crawl-001/ - crawl data from 2008/2010
- [ARC] Archived Crawl #2 - s3://commoncrawl/crawl-002/ - crawl data from 2009/2010
- [ARC] Archived Crawl #3 - s3://commoncrawl/parse-output/ - crawl data from 2012
- [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-20/