Proposal for Dark Data Extraction Research

This document will document progress, ideas and source code for dark data extraction systems. These systems use statistical inference to perform data extraction, integration and cleaning from unstructured/"dark" sources (forum posts, webpages, etc.). Data programming is the predominant paradigm for dark data extraction: noisy/conflicting user-defined functions are supplied to a generative model, which can recover the parameters of labelling process. Wherever possible, my projects are based on Snorkel/DeepDive.

Ideas (Extensions for the system):

There isn't any work on Domain Specific primitives (DSPs) for audio data. Pre-trained audio models (VGGish) can serve as feature extractors for high-level concepts like emotion, accent and personality for speech data(WaveNet paper mentions that these are possible), musical genre (Sander Dieleman's Spotify CNN blog post), etc.

Ideas (Applications):

Ecological/Environmental monitoring: use audio DSPs for building models of migration, logging/poaching, etc.
Digital humanities: understudied history and archaeology archives. Concrete problem: discover trading
Drug repurposing: build a database of serendipitous drug interactions from mentions on internet discussion forums.
Macro-economic indicators (like the Michigan PhD thesis on labor market flows from Twitter data).

ganesh-srinivas/proposal-dark-data-extraction-research.md

Ashiya96 commented Nov 24, 2022

Uh oh!