== Overview of Datasets ==
The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:
-
Wikipedia English-language Article Corpus (
wikipedia_corpus
; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in -
Wikipedia Pagelink Graph (
wikipedia_pagelinks
; ) -- -
Wikipedia Pageview Stats (
wikipedia_pageviews
; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview -
ASA SC/SG Data Expo Airline Flights (
airline_flights
; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes. -
NCDC Hourly Global Weather Measurements, 1929-2009 (
ncdc_weather_hourly
; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929. -
1998 World Cup access logs (
access_logs/ita_world_cup_apachelogs
; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.
- a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
-
Twilio/Wigle.net Street Vector Data Set -- -- geo -- Twilio/Wigle.net database of mapped US street names and address ranges.
-
2008 TIGER/Line Shapefiles -- 125 GB -- geo -- This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.
==== ASA SC/SG Data Expo Airline Flights
This data set is from the ASA Statistical Computing / Statistical Graphics section 2009 contest, "Airline Flight Status -- Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
The data comes originally from the DOT's Research and Innovative Technology Administration (RITA) group, where it is described in detail. You can download the original data there. The files here have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.
Here are a few ideas to get you started exploring the data:
- When is the best time of day/day of week/time of year to fly to minimise delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- How well does weather predict plane delays?
- Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
===== Support data
-
Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight. **
openflights_airports.tsv
(http://openflights.org/data.html#airport:[original]) -- info on about 7000 airports. **openflights_airlines.tsv
(http://openflights.org/data.html#airline:[original]) -- info on about 6000 airline carriers **openflights_routes.tsv
(http://openflights.org/data.html#route:[original]) -- info on about 60_000 routes between 3000 airports on 531 airlines. -
Dataexpo (Public domain): The core airline flights database includes **
dataexpo_airports.tsv
(http://stat-computing.org/dataexpo/2009/supplemental-data.html:[original]) -- info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data. **dataexpo_airplanes.tsv
(http://stat-computing.org/dataexpo/2009/supplemental-data.html:[original]) -- info on about 5030 US commercial airplanes by tail number. **dataexpo_airlines.tsv
(http://stat-computing.org/dataexpo/2009/supplemental-data.html:[original]) -- info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data. -
Wikipedia.org (CC-BY-SA license): Airport identifiers **
wikipedia_airports_iata.tsv
(http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code[original]) -- user-generated dataset pairing airports with their IATA (and often ICAO and FAA) identifiers. **wikipedia_airports_icao.tsv
(http://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code[original]) -- user-generated dataset pairing airports with their ICAO (and often IATA and FAA) identifiers.
The airport datasets contain errors and conflicts; we've done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv
shows where my patience wore out.
=== ITA World Cup Apache Logs
- 1998 World Cup access logs (
access_logs/ita_world_cup_apachelogs
; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.
=== Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD) ===
- 20 GB
- geo, stats
=== Retrosheet
25 Retrosheet: MLB play-by-play, high detail, 1840-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
25 Retrosheet: MLB box scores, 1871-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
=== Other Datasets ===
approx size Mrecs source Data
huge US Patent Data from Google www.google.com/googlebooks/uspto-patents.html[Google Patent Collection]
huge 1 Mathematical constants to billion+'th-place www.numberworld.org/ftp
2_300_000 250000 Wikipedia Pageview Stats dumps.wikimedia.org/other/pagecounts-raw
470_000 Wikibench.eu Wikipedia Log traces Wikibench.eu
124_000 1300000 Access Logs, 1998 World Cup (Internet Traffic Archive) access_logs/ita/ita_world_cup
40_000 B NCDC: Hourly Weather (full) ftp.ncdc.noaa.gov/pub/data/noaa
34_000 10 MLB Gameday Pitch-by-pitch data, 2007-2011 gd2.mlb.com/components/game/mlb
16_000 619 Wikipedia corpus and pagelinks dumps.wikimedia.org/enwiki/20120601
14_000 NCDC: Hourly weather (simplified) ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite
14_000 Memetracker snap.stanford.edu/data/bigdata/memetracker9
14_000 Amazon Co-Purchasing Data snap.stanford.edu/data/bigdata/amazon0312.html
11_000 Crosswikis nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
6_400 NCDC: Daily Weather ftp.ncdc.noaa.gov/pub/data/gsod
6_300 Berkeley Earth Surface Temperature stats/earth_surface_temperature
2_900 Twilio TigerLINE US Street Map geo/us_street_map/addresses
1_900 All US Airline Flights 1987-2009 (ASA Data Expo) stat-computing.org/dataexpo/2009
1_300 Geonames Points of Interest geo/geonames/info
1_300 Daily Prices for all US stocks, 1962–2011 stats/stock_prices
1_040 Patent data (see Google data too) www.nber.org/~jbessen
573 TAKS Exam Scores for all Texas students, 2007-2010 ripd/texas_taks_exam
571 Pi to 1 Billion decimal places ja0hxv.calico.jp/value/pai/val01/pi
419 Enron Email Corpus lang/corpora/enron_trial_coporate_email_corpus
362 DBpedia Wikipedia Article Features downloads.dbpedia.org/3.7/links
331 DBpedia spotlight.dbpedia.org/datasets
310 Grouplens: User-Movie affinity graph/grouplens_movies
305 UFO Sightings (UFORC) geo/ufo_sightings
223 Geonames Postal Codes geo/geonames/postal_codes
121 Book Crossing: User-Book affinity graph/book_crossing
111 Maxmind GeoLite (IP-Geo) data ripd/geolite.maxmind.com/download
91 Access Logs: waxy.org's Star Wars Kid logs access_logs/star_wars_kid
62 Metafilter corpus of postings with metadata ripd/stuff.metafilter.com/infodump
47 Word frequencies from the British National Corpus ucrel.lancs.ac.uk/bncfreq/lists
36 Mobywords thesaurus lang/corpora/thesaurus_mobywords
25 Retrosheet: MLB play-by-play, high detail, 1840-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
25 Retrosheet: MLB box scores, 1871-2011 ripd/www.retrosheet.org-2007/boxesetc/2006
20 US Federal Reserve Bank Loans (Bloomberg) misc/bank_loans_by_fed
11 Scrabble dictionaries lang/corpora/scrabble
11 All Scrabble tile combinations with rack value misc/words_quackle
1000 Marvel Universe Social Graph
. Materials Safety Datasheets
. Crunchbase
. Natural Earth detailed geographic boundaries
. US Census 2009 ACS (Long-form census)
. US Census Geographic boundaries
. Zillow US Neighborhood Boundaries
. Open Street Map
2_000_000 Google Books N-Grams aws.amazon.com/datasets/8172056142375670
60_000_000 Common Crawl Web Corpus 600_000 Apache Software Foundation Public Mail Archives aws.amazon.com/datasets/7791434387204566 300_000 Million-Song dataset labrosa.ee.columbia.edu/millionsong . Reference Energy Disaggregation Dataset (REDD) redd.csail.mit.edu/ . US Legislation Co-Sponsorship jhfowler.ucsd.edu/cosponsorship.htm . VoteView: Political Spectrum Rank of US Legistorls/Laws voteview.org/downloads.asp DW-NOMINATE Rank Orderings all Houses and Senates . World Bank data.worldbank.org . Record of American Democracy road.hmdc.harvard.edu/pages/road-documentation The Record Of American Democracy (ROAD) data includes election returns, socioeconomic summaries, and demographic measures of the American public at unusually low levels of geographic aggregation. The NSF-supported ROAD project covers every state in the country from 1984 through 1990 (including some off-year elections). One collection of data sets includes every election at and above State House, along with party registration and other variables, in each state for the roughly 170,000 precincts nationwide (about 60 times the number of counties). Another collection has added to these (roughly 30-40) political variables an additional 3,725 variables merged from the 1990 U.S. Census for 47,327 aggregate units (about 15 times the number of counties) about the size one or more cities or towns. These units completely tile the U.S. landmass. The collection also includes geographic boundary files so users can easily draw maps with these data. . Human Mortality DB www.mortality.org/ The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity. The project began as an outgrowth of earlier projects in the Department of Demography at the University of California, Berkeley, USA, and at the Max Planck Institute for Demographic Research in Rostock, Germany (see history). It is the work of two teams of researchers in the USA and Germany (see research teams), with the help of financial backers and scientific collaborators from around the world (see acknowledgements). . FCC Antenna locations transition.fcc.gov/mb/databases/cdbs . Pew Research Datasets pewinternet.org/Static-Pages/Data-Tools/Download-Data/Data-Sets.aspx . Youtube Related Videos netsg.cs.sfu.ca/youtubedata . Westbury Usenet Archive (2005-2010) www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs. . Wikipedia Page Traffic Statistics aws.amazon.com/datasets/2596 snap-753dfc1c . Wikipedia Traffic Statistics V2 aws.amazon.com/datasets/4182 snap-0c155c67 . Wikipedia Page Traffic Statistic V3 aws.amazon.com/datasets/6025882142118545 snap-f57dec9a . Marvel Universe Social Graph aws.amazon.com/datasets/5621954952932508 snap-7766d116 10_000 Daily Global Weather, 1929-2009 aws.amazon.com/datasets/2759 snap-ac47f4c5 220_000 Twilio/Wigle.net Street Vector Data Set aws.amazon.com/datasets/2408 snap-5eaf5537 MySQL geo A complete database of US street names and address ranges mapped to zip codes and latitude/longitude ranges, with DTMF key mappings for all street names. . US Economic Data 2003-2006 aws.amazon.com/datasets/2341 snap-0bdf3f62 stats US Economic Data for 2003-2006 from the The US Census Bureau -- raw census data (ACS2002-2006) .
==== Wikibench.eu Wikipedia Log traces ====
logs/wikibench_logtraces
(470 GB)
==== Amazon Co-Purchasing Data ====
==== Patents ====
- http://www.google.com/googlebooks/uspto-patents.html[Google Patent Collection]
==== Marvel Universe Social Graph ====
- 1 GB
- graph
- Social collaboration network of the Marvel comic book universe based on co-appearances.
==== Google Books Ngrams ====
- http://aws.amazon.com/datasets/8172056142375670[Google Books Ngrams]
- 2_000 GB
- graph, linguistics
==== Common Crawl web corpus ====
http://aws.amazon.com/datasets/41740
s3://aws-publicdatasets/common-crawl/crawl-002
A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.
Details
- Size: 60 TB
- Source: Common Crawl Foundation - http://commoncrawl.org
- Created On: February 15, 2012 2:23 AM GMT
- Last Updated: February 15, 2012 2:23 AM GMT
- Available at: s3://aws-publicdatasets/common-crawl/crawl-002/
A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.
Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.
The ARC (.arc) file format used by Common Crawl was developed by the Internet Archive to store their archived crawl data. It is essentially a multi-part gzip file, with each entry in the master gzip (ARC) file being an independent gzip stream in itself. You can use a tool like zcat to spill the contents of an ARC file to stdout. For more information see the Internet Archive's Arc File Format description.
Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.
To learn more about Amazon Elastic MapReduce please see the product detail page.
Common Crawl's Hadoop classes and other code can be found in its GitHub repository.
A tutorial for analyzing Common Crawl's dataset with Amazon Elastic MapReduce called MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl may be found on the Common Crawl blog.
==== Apache Software Foundation Public Mail Archives ====
- Original: http://aws.amazon.com/datasets/7791434387204566[Apache Software Foundation Public Mail Archives]
- 200 GB
- corpus
- A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF)
==== Reference Energy Disaggregation Dataset (REDD) ====
http://redd.csail.mit.edu/[Reference Energy Disaggregation Data Set]
Initial REDD Release, Version 1.0
This is the home page for the REDD data set. Below you can download an initial version of the data set, containing several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. The data itself and the hardware used to collect it are described more thoroughly in the Readme below and in the paper:
\J. Zico Kolter and Matthew J. Johnson. REDD: A public data set for energy disaggregation research. In proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2011. [pdf]
Those wishing to use the dataset in academic work should cite this paper as the reference. Although the data set is freely available, for the time being we still ask those interested in the downloading the data to email us ([email protected]) to receive the username/password to download the data. See the readme.txt file for a full description of the different downloads and their formats
==== The Book-Crossing dataset ====
- http://www.informatik.uni-freiburg.de/~cziegler/BX/[Book Crossing] Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication): Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear. As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.
The Book-Crossing dataset comprises 3 tables.
- BX-Users: Contains the users. Note that user IDs (
User-ID
) have been anonymized and map to integers. Demographic data is provided (Location
,Age
) if available. Otherwise, these fields contain NULL-values. - BX-Books: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (
Book-Title
,Book-Author
,Year-Of-Publication
,Publisher
), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S
,Image-URL-M
,Image-URL-L
), i.e., small, medium, large. These URLs point to the Amazon web site. - BX-Book-Ratings: Contains the book rating information. Ratings (
Book-Rating
) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.
==== Westbury Usenet Archive ====
- http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html[Westbury Usenet Archive] -- USENET corpus (2005-2010) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.
==== Million Song Dataset ====
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
Its purposes are:
To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:
- SecondHandSongs dataset: cover songs
- musiXmatch dataset: lyrics
- Last.fm dataset: song-level tags and similarity
- Taste Profile subset: user data
From the original documentation:
Field name Type Description Link analysis sample rate float sample rate of the audio used url artist 7digitalid int ID from 7digital.com or -1 url artist familiarity float algorithmic estimation url artist hotttnesss float algorithmic estimation url artist id string Echo Nest ID url artist latitude float latitude artist location string location name artist longitude float longitude artist mbid string ID from musicbrainz.org url artist mbtags array string tags from musicbrainz.org url artist mbtags count array int tag counts for musicbrainz tags url artist name string artist name url artist playmeid int ID from playme.com, or -1 url artist terms array string Echo Nest tags url artist terms freq array float Echo Nest tags freqs url artist terms weight array float Echo Nest tags weight url audio md5 string audio hash code bars confidence array float confidence measure url bars start array float beginning of bars, usually on a beat url beats confidence array float confidence measure url beats start array float result of beat tracking url danceability float algorithmic estimation duration float in seconds end of fade in float seconds at the beginning of the song url energy float energy from listener point of view key int key the song is in url key confidence float confidence measure url loudness float overall loudness in dB url mode int major or minor url mode confidence float confidence measure url release string album name release 7digitalid int ID from 7digital.com or -1 url sections confidence array float confidence measure url sections start array float largest grouping in a song, e.g. verse url segments confidence array float confidence measure url segments loudness max array float max dB value url segments loudness max time array float time of max dB value, i.e. end of attack url segments loudness max start array float dB value at onset url segments pitches 2D array float chroma feature, one value per note url segments start array float musical events, ~ note onsets url segments timbre 2D array float texture features (MFCC+PCA-like) url similar artists array string Echo Nest artist IDs (sim. algo. unpublished) url song hotttnesss float algorithmic estimation song id string Echo Nest song ID start of fade out float time in sec url tatums confidence array float confidence measure url tatums start array float smallest rythmic element url tempo float estimated tempo in BPM url time signature int estimate of number of beats per bar, e.g. 4 url time signature confidence float confidence measure url title string song title track id string Echo Nest track ID track 7digitalid int ID from 7digital.com or -1 url year int song release year from MusicBrainz or 0 url
Below is a list of all the fields associated with each track in the database. This is simply an annotated version of the output of the example code display_song.py. For the fields that include a large amount of numerical data, we indicate only the shape of the data array. Since most of these fields are taken directly from the Echo Nest Analyze API, more details can be found at the Echo Nest Analyze API documentation.
A more technically-oriented list of these fields is given on the field list page.
This example data is shown for the track whose track_id is TRAXLZU12903D05F94 - namely, "Never Gonna Give You Up" by Rick Astley.
artist_mbid: db92a151-1ac2-438b-bc43-b82e149ddd50 the musicbrainz.org ID for this artists is db9...
artist_mbtags: shape = (4,) this artist received 4 tags on musicbrainz.org
artist_mbtags_count: shape = (4,) raw tag count of the 4 tags this artist received on musicbrainz.org
artist_name: Rick Astley artist name
artist_playmeid: 1338 the ID of that artist on the service playme.com
artist_terms: shape = (12,) this artist has 12 terms (tags) from The Echo Nest
artist_terms_freq: shape = (12,) frequency of the 12 terms from The Echo Nest (number between 0 and 1)
artist_terms_weight: shape = (12,) weight of the 12 terms from The Echo Nest (number between 0 and 1)
audio_md5: bf53f8113508a466cd2d3fda18b06368 hash code of the audio used for the analysis by The Echo Nest
bars_confidence: shape = (99,) confidence value (between 0 and 1) associated with each bar by The Echo Nest
bars_start: shape = (99,) start time of each bar according to The Echo Nest, this song has 99 bars
beats_confidence: shape = (397,) confidence value (between 0 and 1) associated with each beat by The Echo Nest
beats_start: shape = (397,) start time of each beat according to The Echo Nest, this song has 397 beats
danceability: 0.0 danceability measure of this song according to The Echo Nest (between 0 and 1, 0 => not analyzed)
duration: 211.69587 duration of the track in seconds
end_of_fade_in: 0.139 time of the end of the fade in, at the beginning of the song, according to The Echo Nest
energy: 0.0 energy measure (not in the signal processing sense) according to The Echo Nest (between 0 and 1, 0 => not analyzed)
key: 1 estimation of the key the song is in by The Echo Nest
key_confidence: 0.324 confidence of the key estimation
loudness: -7.75 general loudness of the track
mode: 1 estimation of the mode the song is in by The Echo Nest
mode_confidence: 0.434 confidence of the mode estimation
release: Big Tunes - Back 2 The 80s album name from which the track was taken, some songs / tracks can come from many albums, we give only one
release_7digitalid: 786795 the ID of the release (album) on the service 7digital.com
sections_confidence: shape = (10,) confidence value (between 0 and 1) associated with each section by The Echo Nest
sections_start: shape = (10,) start time of each section according to The Echo Nest, this song has 10 sections
segments_confidence: shape = (935,) confidence value (between 0 and 1) associated with each segment by The Echo Nest
segments_loudness_max: shape = (935,) max loudness during each segment
segments_loudness_max_time: shape = (935,) time of the max loudness during each segment
segments_loudness_start: shape = (935,) loudness at the beginning of each segment
segments_pitches: shape = (935, 12) chroma features for each segment (normalized so max is 1.)
segments_start: shape = (935,) start time of each segment (~ musical event, or onset) according to The Echo Nest, this song has 935 segments
segments_timbre: shape = (935, 12) MFCC-like features for each segment
similar_artists: shape = (100,) a list of 100 artists (their Echo Nest ID) similar to Rick Astley according to The Echo Nest
song_hotttnesss: 0.864248830588 according to The Echo Nest, when downloaded (in December 2010), this song had a 'hotttnesss' of 0.8 (on a scale of 0 and 1)
song_id: SOCWJDB12A58A776AF The Echo Nest song ID, note that a song can be associated with many tracks (with very slight audio differences)
start_of_fade_out: 198.536 start time of the fade out, in seconds, at the end of the song, according to The Echo Nest
tatums_confidence: shape = (794,) confidence value (between 0 and 1) associated with each tatum by The Echo Nest
tatums_start: shape = (794,) start time of each tatum according to The Echo Nest, this song has 794 tatums
tempo: 113.359 tempo in BPM according to The Echo Nest
time_signature: 4 time signature of the song according to The Echo Nest, i.e. usual number of beats per bar
time_signature_confidence: 0.634 confidence of the time signature estimation
title: Never Gonna Give You Up song title
track_7digitalid: 8707738 the ID of this song on the service 7digital.com
track_id: TRAXLZU12903D05F94 The Echo Nest ID of this particular track on which the analysis was done
year: 1987 year when this song was released, according to musicbrainz.org
==== Google / Stanford Crosswiki ====
http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/[wikipedia_words]
This data set accompanies
Valentin I. Spitkovsky and Angel X. Chang. 2012. A Cross-Lingual Dictionary for English Wikipedia Concepts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
Please cite the appropriate publication if you use this data. (See http://nlp.stanford.edu/publications.shtml for .bib entries.)
There are six line-based (and two other) text files, each of them lexicographically sorted, encoded with UTF-8, and compressed using bzip2 (-9). One way to view the data without fully expanding it first is with the bzcat command, e.g.,
bzcat dictionary.bz2 | grep ... | less
Note that raw data were gathered from heterogeneous sources, at different points in time, and are thus sometimes contradictory. We made a best effort at reconciling the information, but likely also introduced some bugs of our own, so be prepared to write fault-tolerant code... keep in mind that even tiny error rates translate into millions of exceptions, over billions of datums.
==== English Gigaword Dataset (LDC) ====
The http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13[English Gigaword] corpus, now being released in its fourth edition, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC at the University of Pennsylvania. The fourth edition includes all of the contents in English Gigawaord Third Edition (LDC2007T07) plus new data covering the 24-month period of January 2007 through December 2008. Portions of the dataset are © 1994-2008 Agence France Presse, © 1994-2008 The Associated Press, © 1997-2008 Central News Agency (Taiwan), © 1994-1998, 2003-2008 Los Angeles Times-Washington Post News Service, Inc., © 1994-2008 New York Times, © 1995-2008 Xinhua News Agency, © 2009 Trustees of the University of Pennsylvania. The six distinct international sources of English newswire included in this edition are the following:
Agence France-Presse, English Service (afp_eng) Associated Press Worldstream, English Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng) Los Angeles Times/Washington Post Newswire Service (ltw_eng) New York Times Newswire Service (nyt_eng) Xinhua News Agency, English Service (xin_eng) New in the Fourth Edition
For an example of the data in this corpus, please review http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2009T13.html[this sample file].
=== Sources of public and Commercial data
((data_commons))
- Infochimps
- Factual
- CKAN
- Get.theinfo
- Microsoft Azure Data Marketplace