Skip to content

Instantly share code, notes, and snippets.

@rnirmal
Created June 11, 2017 15:03
Show Gist options
  • Save rnirmal/e01acfdaf54a6f9b24e91ba4cae63518 to your computer and use it in GitHub Desktop.
Save rnirmal/e01acfdaf54a6f9b24e91ba4cae63518 to your computer and use it in GitHub Desktop.
datasetName about link categoryName cloud vintage
Microbiome Project American Gut (Microbiome Project) https://github.com/biocore/American-Gut Biology GitHub NA
GloBI Global Biotic Interactions (GloBI) https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data Biology GitHub NA
Global Climate Global Climate Data Since 1929 http://en.tutiempo.net/climate Climate/Weather 1929
CommonCraw 2012 3.5B Web Pages from CommonCraw 2012 http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us Computer Networks 2012
Indiana Webclicks 53.5B Web clicks of 100K users in Indiana Univ. http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ Computer Networks NA
Criteo click-through Criteo click-through data http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ Computer Networks NA
ICWSM 2009 ICWSM Data Challenge (since 2009) http://icwsm.cs.umbc.edu/ Data Challenges 2009
KDD Cup KDD Cup by Tencent 2012 http://www.kddcup2012.org/ Data Challenges 2012
Localytics Data Localytics Data Visualization Challenge https://github.com/localytics/data-viz-challenge Data Challenges GitHub NA
Yelp Dataset Yelp Dataset Challenge http://www.yelp.com/dataset_challenge Data Challenges NA
Bruteforce Database Bruteforce Database https://github.com/duyetdev/bruteforce-database Data Challenges GitHub NA
Countries List of all countries in all languages https://github.com/umpirsky/country-list GIS GitHub NA
TwoFishes TwoFishes - Foursquare's coarse geocoder https://github.com/foursquare/twofishes GIS GitHub NA
World countries World countries in multiple formats https://github.com/mledoze/countries GIS GitHub NA
Cities and countries A list of cities and countries contributed by community https://github.com/caesar0301/awesome-public-datasets/blob/master/Government.rst Government GitHub NA
Ebola cases Number of Ebola Cases and Deaths in Affected Countries (2014) https://data.hdx.rwlabs.org/dataset/ebola-cases-2014 Healthcare 2014
eBay Online eBay Online Auctions (2012) http://www.modelingonlineauctions.com/datasets Machine Learning 2012
New Yorker Captions New Yorker caption contest ratings https://github.com/nextml/caption-contest-data Machine Learning GitHub NA
Cooper-Hewitt's Collection Cooper-Hewitt's Collection Database https://github.com/cooperhewitt/collection Museums GitHub NA
Minneapolis Institute Minneapolis Institute of Arts metadata https://github.com/artsmia/collection Museums GitHub NA
Tate Collection Tate Collection metadata https://github.com/tategallery/collection Museums GitHub NA
Google 5gram Google Web 5gram (1TB, 2006) https://catalog.ldc.upenn.edu/LDC2006T13 Natural Language 2006
Arabic, 30K articles SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles) https://github.com/ParallelMazen/SaudiNewsNet Natural Language GitHub NA
USENET postings USENET postings corpus of 2005~2011 http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html Natural Language 2005
Datahub.io Datahub.io https://datahub.io/dataset Search Engines NA
Twitter Scrape CIKM Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape https://archive.org/details/twitter_cikm_2010 Social Networks 2009
Facebook Data Facebook Data Scrape (2005) https://archive.org/details/oxford-2005-facebook-matrix Social Networks 2005
LAW graphs Facebook Social Networks from LAW (since 2007) http://law.di.unimi.it/datasets.php Social Networks 2007
Foursquare from Foursquare from UMN/Sarwat (2013) https://archive.org/details/201309_foursquare_dataset_umn Social Networks 2013
Skytrax' Air Skytrax' Air Travel Reviews Dataset https://github.com/quankiquanki/skytrax-reviews-dataset Social Networks GitHub NA
Twitter Scrape Twitter Scrape Calufa May 2011 http://archive.org/details/2011-05-calufa-twitter-sql Social Networks 2011
Youtube Video Youtube Video Social Graph in 2007,2008 http://netsg.cs.sfu.ca/youtubedata/ Social Networks 2007
FBI Hate Crime 2013 FBI Hate Crime 2013 - aggregated data https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013 Social Sciences GitHub 2013
GSS General Social Survey (GSS) since 1972 http://gss.norc.org Social Sciences 1972
Texas Inmates Texas Inmates Executed Since 1984 http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html Social Sciences 1984
Formula 1 Ergast Formula 1, from 1950 up to date (API) http://ergast.com/mrd/db Sports 1950
Pinhooker: Thoroughbred Pinhooker: Thoroughbred Bloodstock Sale Data https://github.com/phillc73/pinhooker Sports GitHub NA
Airlines OD Airlines OD Data 1987-2008 http://stat-computing.org/dataexpo/2009/the-data.html Transportation 2008
BSS Bike Share Systems (BSS) collection https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems Transportation GitHub NA
NYC Taxi NYC Taxi Trip Data 2009- http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Transportation 2009
FOIA/FOILed NYC Taxi Trip Data 2013 (FOIA/FOILed) https://archive.org/details/nycTaxiTripData2013 Transportation 2013
NYC Uber NYC Uber trip data April 2014 to September 2014 https://github.com/fivethirtyeight/uber-tlc-foil-response Transportation GitHub 2014
Open Traffic Open Traffic collection https://github.com/graphhopper/open-traffic-collection Transportation GitHub NA
Plane Crash Plane Crash Database, since 1920 http://www.planecrashinfo.com/database.htm Transportation 1920
U.S. Domestic U.S. Domestic Flights 1990 to 2009 http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a Transportation 2009
U.S. Freight U.S. Freight Analysis Framework since 2007 http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm Transportation 2007
Data Packaged Data Packaged Core Datasets https://github.com/datasets/ Complementary Collections GitHub NA
USDA PLANTS U.S. Department of Agriculture's PLANTS Database http://www.plants.usda.gov/dl_all.html Agriculture NA
ClueWeb09 ClueWeb09 - 1B web pages http://lemurproject.org/clueweb09/ Computer Networks 2009
ClueWeb12 ClueWeb12 - 733M web pages http://lemurproject.org/clueweb12/ Computer Networks 2012
DEFRA Projects DEFRA Science and Research Projects data http://randd.defra.gov.uk/ Energy NA
UK-DALE UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://www.doc.ic.ac.uk/~dk3810/data/ Energy 2016
Landsat 8 Landsat 8 on AWS https://aws.amazon.com/public-data-sets/landsat/ GIS Amazon NA
Reverse Geocode Simple but fast reverse geocoding up to city granularitiy level https://github.com/kno10/reversegeocode GIS GitHub NA
Faces Database 10k US Adult Faces Database http://wilmabainbridge.com/facememorability2.html Image Processing NA
ClueWeb09 FACC ClueWeb09 FACC http://lemurproject.org/clueweb09/FACC1/ Natural Language 2009
ClueWeb12 FACC ClueWeb12 FACC http://lemurproject.org/clueweb12/FACC1/ Natural Language 2012
Google Ngrams Google Books Ngrams (2.2TB) https://aws.amazon.com/datasets/google-books-ngrams/ Natural Language Amazon NA
EDRM Enron EDRM Enron EMail of 151 users, hosted on S3 https://aws.amazon.com/datasets/enron-email-data/ Social Networks Amazon NA
GetGlue GetGlue - users rating TV shows http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz Social Networks NA
Twitter RepLab Twitter Data for Online Reputation Management http://nlp.uned.es/replab2013/ Social Networks 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment