Created
June 11, 2017 15:03
-
-
Save rnirmal/e01acfdaf54a6f9b24e91ba4cae63518 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
datasetName | about | link | categoryName | cloud | vintage | |
---|---|---|---|---|---|---|
Microbiome Project | American Gut (Microbiome Project) | https://github.com/biocore/American-Gut | Biology | GitHub | NA | |
GloBI | Global Biotic Interactions (GloBI) | https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data | Biology | GitHub | NA | |
Global Climate | Global Climate Data Since 1929 | http://en.tutiempo.net/climate | Climate/Weather | 1929 | ||
CommonCraw 2012 | 3.5B Web Pages from CommonCraw 2012 | http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us | Computer Networks | 2012 | ||
Indiana Webclicks | 53.5B Web clicks of 100K users in Indiana Univ. | http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ | Computer Networks | NA | ||
Criteo click-through | Criteo click-through data | http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ | Computer Networks | NA | ||
ICWSM 2009 | ICWSM Data Challenge (since 2009) | http://icwsm.cs.umbc.edu/ | Data Challenges | 2009 | ||
KDD Cup | KDD Cup by Tencent 2012 | http://www.kddcup2012.org/ | Data Challenges | 2012 | ||
Localytics Data | Localytics Data Visualization Challenge | https://github.com/localytics/data-viz-challenge | Data Challenges | GitHub | NA | |
Yelp Dataset | Yelp Dataset Challenge | http://www.yelp.com/dataset_challenge | Data Challenges | NA | ||
Bruteforce Database | Bruteforce Database | https://github.com/duyetdev/bruteforce-database | Data Challenges | GitHub | NA | |
Countries | List of all countries in all languages | https://github.com/umpirsky/country-list | GIS | GitHub | NA | |
TwoFishes | TwoFishes - Foursquare's coarse geocoder | https://github.com/foursquare/twofishes | GIS | GitHub | NA | |
World countries | World countries in multiple formats | https://github.com/mledoze/countries | GIS | GitHub | NA | |
Cities and countries | A list of cities and countries contributed by community | https://github.com/caesar0301/awesome-public-datasets/blob/master/Government.rst | Government | GitHub | NA | |
Ebola cases | Number of Ebola Cases and Deaths in Affected Countries (2014) | https://data.hdx.rwlabs.org/dataset/ebola-cases-2014 | Healthcare | 2014 | ||
eBay Online | eBay Online Auctions (2012) | http://www.modelingonlineauctions.com/datasets | Machine Learning | 2012 | ||
New Yorker Captions | New Yorker caption contest ratings | https://github.com/nextml/caption-contest-data | Machine Learning | GitHub | NA | |
Cooper-Hewitt's Collection | Cooper-Hewitt's Collection Database | https://github.com/cooperhewitt/collection | Museums | GitHub | NA | |
Minneapolis Institute | Minneapolis Institute of Arts metadata | https://github.com/artsmia/collection | Museums | GitHub | NA | |
Tate Collection | Tate Collection metadata | https://github.com/tategallery/collection | Museums | GitHub | NA | |
Google 5gram | Google Web 5gram (1TB, 2006) | https://catalog.ldc.upenn.edu/LDC2006T13 | Natural Language | 2006 | ||
Arabic, 30K articles | SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles) | https://github.com/ParallelMazen/SaudiNewsNet | Natural Language | GitHub | NA | |
USENET postings | USENET postings corpus of 2005~2011 | http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html | Natural Language | 2005 | ||
Datahub.io | Datahub.io | https://datahub.io/dataset | Search Engines | NA | ||
Twitter Scrape CIKM | Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape | https://archive.org/details/twitter_cikm_2010 | Social Networks | 2009 | ||
Facebook Data | Facebook Data Scrape (2005) | https://archive.org/details/oxford-2005-facebook-matrix | Social Networks | 2005 | ||
LAW graphs | Facebook Social Networks from LAW (since 2007) | http://law.di.unimi.it/datasets.php | Social Networks | 2007 | ||
Foursquare from | Foursquare from UMN/Sarwat (2013) | https://archive.org/details/201309_foursquare_dataset_umn | Social Networks | 2013 | ||
Skytrax' Air | Skytrax' Air Travel Reviews Dataset | https://github.com/quankiquanki/skytrax-reviews-dataset | Social Networks | GitHub | NA | |
Twitter Scrape | Twitter Scrape Calufa May 2011 | http://archive.org/details/2011-05-calufa-twitter-sql | Social Networks | 2011 | ||
Youtube Video | Youtube Video Social Graph in 2007,2008 | http://netsg.cs.sfu.ca/youtubedata/ | Social Networks | 2007 | ||
FBI Hate Crime 2013 | FBI Hate Crime 2013 - aggregated data | https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013 | Social Sciences | GitHub | 2013 | |
GSS | General Social Survey (GSS) since 1972 | http://gss.norc.org | Social Sciences | 1972 | ||
Texas Inmates | Texas Inmates Executed Since 1984 | http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html | Social Sciences | 1984 | ||
Formula 1 | Ergast Formula 1, from 1950 up to date (API) | http://ergast.com/mrd/db | Sports | 1950 | ||
Pinhooker: Thoroughbred | Pinhooker: Thoroughbred Bloodstock Sale Data | https://github.com/phillc73/pinhooker | Sports | GitHub | NA | |
Airlines OD | Airlines OD Data 1987-2008 | http://stat-computing.org/dataexpo/2009/the-data.html | Transportation | 2008 | ||
BSS | Bike Share Systems (BSS) collection | https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems | Transportation | GitHub | NA | |
NYC Taxi | NYC Taxi Trip Data 2009- | http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml | Transportation | 2009 | ||
FOIA/FOILed | NYC Taxi Trip Data 2013 (FOIA/FOILed) | https://archive.org/details/nycTaxiTripData2013 | Transportation | 2013 | ||
NYC Uber | NYC Uber trip data April 2014 to September 2014 | https://github.com/fivethirtyeight/uber-tlc-foil-response | Transportation | GitHub | 2014 | |
Open Traffic | Open Traffic collection | https://github.com/graphhopper/open-traffic-collection | Transportation | GitHub | NA | |
Plane Crash | Plane Crash Database, since 1920 | http://www.planecrashinfo.com/database.htm | Transportation | 1920 | ||
U.S. Domestic | U.S. Domestic Flights 1990 to 2009 | http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a | Transportation | 2009 | ||
U.S. Freight | U.S. Freight Analysis Framework since 2007 | http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm | Transportation | 2007 | ||
Data Packaged | Data Packaged Core Datasets | https://github.com/datasets/ | Complementary Collections | GitHub | NA | |
USDA PLANTS | U.S. Department of Agriculture's PLANTS Database | http://www.plants.usda.gov/dl_all.html | Agriculture | NA | ||
ClueWeb09 | ClueWeb09 - 1B web pages | http://lemurproject.org/clueweb09/ | Computer Networks | 2009 | ||
ClueWeb12 | ClueWeb12 - 733M web pages | http://lemurproject.org/clueweb12/ | Computer Networks | 2012 | ||
DEFRA Projects | DEFRA Science and Research Projects data | http://randd.defra.gov.uk/ | Energy | NA | ||
UK-DALE | UK Domestic Appliance-Level Electricity (UK-DALE) dataset | http://www.doc.ic.ac.uk/~dk3810/data/ | Energy | 2016 | ||
Landsat 8 | Landsat 8 on AWS | https://aws.amazon.com/public-data-sets/landsat/ | GIS | Amazon | NA | |
Reverse Geocode | Simple but fast reverse geocoding up to city granularitiy level | https://github.com/kno10/reversegeocode | GIS | GitHub | NA | |
Faces Database | 10k US Adult Faces Database | http://wilmabainbridge.com/facememorability2.html | Image Processing | NA | ||
ClueWeb09 FACC | ClueWeb09 FACC | http://lemurproject.org/clueweb09/FACC1/ | Natural Language | 2009 | ||
ClueWeb12 FACC | ClueWeb12 FACC | http://lemurproject.org/clueweb12/FACC1/ | Natural Language | 2012 | ||
Google Ngrams | Google Books Ngrams (2.2TB) | https://aws.amazon.com/datasets/google-books-ngrams/ | Natural Language | Amazon | NA | |
EDRM Enron | EDRM Enron EMail of 151 users, hosted on S3 | https://aws.amazon.com/datasets/enron-email-data/ | Social Networks | Amazon | NA | |
GetGlue | GetGlue - users rating TV shows | http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz | Social Networks | NA | ||
Twitter RepLab | Twitter Data for Online Reputation Management | http://nlp.uned.es/replab2013/ | Social Networks | 2013 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment