This document will document progress, ideas and source code for dark data extraction systems. These systems use statistical inference to perform data extraction, integration and cleaning from unstructured/"dark" sources (forum posts, webpages, etc.). Data programming is the predominant paradigm for dark data extraction: noisy/conflicting user-defined functions are supplied to a generative model, which can recover the parameters of labelling process. Wherever possible, my projects are based on Snorkel/DeepDive.
Ideas (Extensions for the system):
- There isn't any work on Domain Specific primitives (DSPs) for audio data. Pre-trained audio models (VGGish) can serve as feature extractors for high-level concepts like emotion, accent and personality for speech data(WaveNet paper mentions that these are possible), musical genre (Sander Dieleman's Spotify CNN blog post), etc.
Ideas (Applications):
- Ecological/Environmental monitoring: use audio DSPs for building models of migration, logging/poaching, etc.
- Digital humanities: understudied history and archaeology archives. Concrete problem: discover trading
- Drug repurposing: build a database of serendipitous drug interactions from mentions on internet discussion forums.
- Macro-economic indicators (like the Michigan PhD thesis on labor market flows from Twitter data).
Hi. I am also working on dark data. Could you please help me figure out how to collect data for thesis? I thought of university domain, but again there might not be much dark data. Then I thought of business domain I still don't know how to get data from. Do you have any suggestion in which area should I do my study? Thank you