== Overview of Datasets ==
The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:
-
Wikipedia English-language Article Corpus (
wikipedia_corpus
; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in -
Wikipedia Pagelink Graph (
wikipedia_pagelinks
; ) -- -
Wikipedia Pageview Stats (
wikipedia_pageviews
; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview