With the open data and reproducible research movements, it’s becoming more and more common for researchers and analysts to make datasets public. But just as putting your code on GitHub as is doesn’t make it a good open source project, putting your zipped CSV files on a website doesn’t make it a good open dataset. For example, it’s not uncommon to have to spend half the length of a project just cleaning a dataset/project just cleaning a dataset.
This talk is about pitfalls commonly encountered when working with unfamiliar datasets, and how to help your audience avoid such pitfalls when you publish your own datasets. This is a “best practices” talk, but along with strategies for dealing with the issues, the talk will mention relevant python libraries, tools and techniques that might help tackle each problem.