With the open data and reproducible research movements, it’s becoming more and more common for researchers and analysts to make datasets public. But just as putting your code on GitHub as is doesn’t make it a good open source project, putting your zipped CSV files on a website doesn’t make it a good open dataset. For example, it’s not uncommon to have to spend half the length of a project just cleaning a dataset/project just cleaning a dataset.
This talk is about pitfalls commonly encountered when working with unfamiliar datasets, and how to help your audience avoid such pitfalls when you publish your own datasets. This is a “best practices” talk, but along with strategies for dealing with the issues, the talk will mention relevant python libraries, tools and techniques that might help tackle each problem.
-
I work at the Center for Int'l Development - we study developing countries and we have economic theories about why some develop better than others. We do a lot of projects with country governments where they give us datasets that we analyze and give back to them an interactive data visualization website.
-
We quickly found that “having the data” means nothing - is it any good? We spent months cleaning this social security dataset because the form looked like this: (picture of free entry textbox instead of dropdown / pic of 4 year old classification system being used today)
-
Often scientists generate the dataset for something - so why not make it a pleasure to use?
-
Golden rule: Do unto others ...
- Consider your audience, put yourself into their shoes.
- Experiment: Eat your own dogfood. Switch computers and try using your dataset. What does it take?
- You don’t know how your dataset is going to be used. Data has a way of being used for different purposes - cellphone call data to predict density
- Bad data can come from a mismatch of user expectation vs creator purpose
-
Have a schema (codebook / data dictionary)
- Numbers and data mean nothing by themselves.
- Don’t assume that something will be obvious to the reader from the context
- Open standards are cool but not necessary, as long as the user can use it: http://data.okfn.org/standards or RDF etc
- Runnable schemas are always better!
-
Have proper types (or you will shoot yourself in the foot)
- Things that are not numbers:
- Social security “numbers”, Classification systems, Phone numbers, Location codes
- Numbers have mathematical properties, these don’t. Does it make sense to add two of them together? Does it make sense to add a zero onto the left?
- Define what “nothing” means and be consistent.
- No data available (N/A) identifier
- Empty field
- Zero
- If you have a hierarchy of classifications, make sure levels can’t be confused. Maybe have an overarching scheme? Cutting digits is not smart.
- Format things nicely
- Machine readable
- Don’t include units or markings like decimal commas
- Things that are not numbers:
-
Be consistent
- Consistent formats - dates, numbers, units
- Consistent null values
- Consistent naming schemes
- Consistent layout
- Anything that is "bad" about a dataset is tolerable, as long as I have to fix it only once.
-
Double check your dataset
- Does your data match your schema?
- Are your fields all unique?
- Do your fields have multiple values of different types?
- Do your uniqueness constraints hold?
- Clean is not the same as clean-looking! Clean looking data is not more reliable. Do NOT drop data to make things look prettier - you could lose crucial data and make people reach wrong conclusions.
- Acid test: Would you have to spend time before you could test a hypothesis on your dataset /today/?
-
Errors propagate through steps of calculation!
-
Have a clear process and methodology
- Provenance!
- How was the data collected? Through what means? Was it a survey? Was it scraped off of websites? Was it OCR’d off paper? Was it through image recognition? Was it through some device? What could have gone wrong during that process?
- Writing methodology in a paper is not as concrete as runnable code! You can accidentally BS your way through.
- Keep original data intact, and have a runnable set of transformations - you never know at which step you’re losing data
- Provide abstraction, but don’t hide the lower layers
- reproducible process, put it in a vagrant image too to reduce reproduction costs!
-
Make the data easy to access
- Use non-proprietary formats
- Use future-proof formats
- Use widely supported formats
- Build an API! (Pandas / ggplot examples!) (cool one: http://www.bls.gov/cew/doc/access/data_access_examples.htm)
- If all else fails, define the format exhaustively
-
Classification systems
-
Make datasets easily mergeable
- Standard ISO codes for countries would be cool!
- Or numbers
- Strings are often worse than identifiers because it makes it harder for OTHER people to merge with your data.
- probabilistic (fuzzy) matching sucks to do
-
Having datasets normalized is helpful sometimes
- Have “Country Name” as a separate table instead of joining it in might be good maybe!
-
Having datasets disaggregated is helpful sometimes
- gdp and number of people is better than gdp per capita
- Municipality level is better than metropolitan area level
- Descriptive statistics can hide data - Anscombe’s quartet
-
Bad data examples
- http://okfnlabs.org/bad-data/
- my collected examples from google docs
-
General advice
- Fix data issues before you have them.
- Fix data BEFORE you need it fixed. Sanitizing input is always better, catches the data where it still has context and the expert is still there.
- If you’re aggregating different datasets of the same thing, fix column names and standardize format BEFORE! For example, define a format, then fix every year for that format, then process the standardized data all of it in one go. Contain complexity at the right level rather than having to duplicate code.
Fellow at Harvard CID, originally from Istanbul, Turkey, worked at a bunch of places from startups to KAYAK.
More: http://akmanalp.com/pages/about.html Resume: http://www.linkedin.com/in/maliakmanalp Twitter: https://twitter.com/makmanalp
- Pycon 2015 Montreal
- Actually polished and comprehensive version of the DataCon talk below
- Slides: http://akmanalp.com/other-peoples-data/
- Video: https://www.youtube.com/watch?v=_eQ_8U5kruQ&feature=youtu.be
- https://github.com/open-contracting/standard
- Harvard (work)
- Gave an hour long workshop on how websites work: https://www.youtube.com/watch?v=SFtQUxksxZE and http://akmanalp.com/how-websites-work/
- Boston Datacon
- On practical methods for data cleaning - done on short notice: https://github.com/makmanalp/datacon-talk-2014
- Boston Python (less relevant subject matter)
- On a distributed ID generator: http://www.youtube.com/watch?v=SCQPBGi\_QRk and http://akmanalp.com/simpleflake_presentation/
- On a flask boilerplate: http://www.youtube.com/watch?v=L4Mfq985png and http://akmanalp.com/chassis_presentation/