Skip to content

Instantly share code, notes, and snippets.

@r-sal
Last active February 18, 2023 16:57
Show Gist options
  • Save r-sal/5061825 to your computer and use it in GitHub Desktop.
Save r-sal/5061825 to your computer and use it in GitHub Desktop.
List of various types of datasets

Dataset Bookmarks


Data, Data, Data: Thousands of Public Data Sources
Machine Learning Datasets - Sources for specific Machine Learning datasets
UCI Machine Learning Repository: Data Sets

Datasets

Misc

USA 2011 Car crash data - The dataset is collection of data about people involved in car accidents with fatalities, the final injuries, alcohol/drugs tests, and other relevant data about the accident and the person. Source: atality Analysis Reporting System (FARS) Encyclopedia

Enviormental

Environmental Hazard Rank - The EDR Environmental Hazard Ranking System depicts the relative environmental health of any U.S. ZIP code based on an advanced analysis of its environmental issues. It uses state-of-the-art geographic information system to parse data from NEDIS™, EDR's proprietary master database which contains more than 3.1 billion records of potential and real environmental hazards culled from over 1,400 continually updated databases. The EDR Environmental Hazard Ranking System uses an advanced scoring methodology to assign points to environmental records based on their hazard level and approximate cleanup cost. The results are then aggregated by ZIP code to provide you with a rank so you can see how the ZIP code you're interested in stacks up.
Worldwide Historical Weather Data

Open Companies data sources

Machine Learning Challenges

There are various machine learning challenges that offer data for their challenges. Often this data remains available even after the challenge is closed.

Marketplaces and data hubs

Commercial marketplaces and non-commercial data hubs
http://bitly.com/bundles/bigmlcom/3
Infochimps
Window Azure Marketplace (Free Datasets)

Sort


[Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part One)](http://blog.bigml.com/2013/02/15/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-part-one/) [research-quality data sets](https://bitly.com/bundles/hmason/1) [get.theinfo](https://groups.google.com/forum/?fromgroups#!forum/get-theinfo) [Google Ngram Datasets](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) [Facebook100 Data Set](http://masonporter.blogspot.com/2011/02/facebook100-data-set.html) - includes the complete set of people and friendships from the Facebook networks of 100 different colleges and universities from a single snapshot from September 2005.

http://theinfo.org/ http://infochimps.org/datasets http://ckan.org [Comprehensive Knowledge Archive Network] http://www.datawrangling.com/some-datasets-available-on-the-web.html http://del.icio.us/pskomoroch/dataset http://www.reddit.com/r/datasets/ http://news.ycombinator.com/item?id=1242029 http://www.reddit.com/r/opendata http://www.trustlet.org/wiki/Repositories_of_datasets http://www.daniel-lemire.com/blog/data-for-data-mining/ http://www.quantlet.org/mdbase/ http://datamob.org/ http://freebase.com/ http://infochimp.info/ics/data/ripd/www-personal.umich.edu/~mejn/netdata/ http://www.archive-it.org/public/all_collections

Large: http://www.ckan.net/tag/read/size-large http://www.diggingintodata.org/Repositories/tabid/167/Default.aspx Web as corpus: Good instructions: http://corpus.leeds.ac.uk/internet.html#description http://sslmit.unibo.it/~baroni/bootcat.html http://www.drni.de/wac-tk/index.php/Documentation

http://radar.oreilly.com/2010/03/open-data-pointers.html http://www.datawrangling.com/some-datasets-available-on-the-... http://del.icio.us/pskomoroch/dataset http://infochimps.com/collections/datamob (and the other collections on the site)

http://data.stackexchange.com/ http://www.data.gov/

Open Data sources


Data from International Bodies like the UN, IMF etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment