Skip to content

Instantly share code, notes, and snippets.

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  • [ARC] Archived Crawl #1 - s3://commoncrawl/crawl-001/ - crawl data from 2008/2010
  • [ARC] Archived Crawl #2 - s3://commoncrawl/crawl-002/ - crawl data from 2009/2010
  • [ARC] Archived Crawl #3 - s3://commoncrawl/parse-output/ - crawl data from 2012
  • [WARC] s3://commoncrawl/crawl-data/CC-MAIN-2013-20/
<markdown>
Post content
</markdown>
@jedsundwall
jedsundwall / gist:2244395
Created March 29, 2012 22:29 — forked from tantalor/gist:2244383
hacking google spreadhseets
function makeURL(firstname,lastname) {
var name = firstname + lastname;
if (!name) {
return null;
} else {
return "http://vivemejor.com/blahblah?this&that"+name;
}
}