Ingestion web interface

Interface for our team to upload vendor lists from e.g. Catalist, L2, etc.

User-facing features

User must specify source -- Can use pre-existing sources, or can make their own which becomes part of the choices for other users
Users can see status of their uploaded list -- i.e., has it been ETL'd yet and loaded into the main voter list database for querying?

Backend

To be continued ...

Meta-information

Identity of user who uploaded the vendor list
Time of upload
(maybe) Original date/time of file?
Additional fields for more info about the voter list (for example, description, vendor SQL query, probably a lot more stuff I can't think of)

ETL

Data normalization

Harmonize different fields of interest (e.g. first name, last name, age, race, gender, political affiliation, precinct, districts, ID, phone numbers, probably a lot more too...)
Creation of new fields? (Like source, probably)

Deduplication

Updating records for someone who gets uploaded again (Will need to think about this issue more.)

Once data has been deduplicated, will need to ensure that there is a relationship stored between pre- and post- deduplicated rows. See ways to do this below under tables.

Application logic

For 'standard' sources (like Catalist, L2, etc.) the ETL will be easier to achieve because of the consistency of fields present in those voter lists. However, for client-supplied data, we obviously cannot expect any consistency. ETL'ing those files will be more difficult.

Possible solutions:

Completely automated approach -- use heuristics to try to find the appropriate columns (like age, race, etc.). Disadvantage is this will require complicated programming.
User is forced to map certain columns from the client list to e.g. age, race, phone, address, etc -- All of the fields we want them to have. (And probably optional fields as well.) This can also include our own validation to ensure, for example, that what the user says is a phone field actually looks like a phone field. If it doesn't, we can flag the upload for review with optional notification to us on the tech team.
Our consensus is to prefer the second option.

Database

Ultimately, the ETL'd vendor lists should go into our Redshift database. Putting it in Redshift gives us the advatange being able to query these data in conjunction with the Spoke data within the spokewarehouse schema. (It also of course has the advantage of using infrastructure that already exists.)

Schema name: voterlists

Tables

A table for raw data, like Catalist or L2.

Another table with normalized, ETL'd data.

A table mapping relationships between raw and normalized rows, OR a column in the raw data tables storing the foreign id key of the associated row in the ETL'd data table. This is a many-to-one relationship (many raw rows will correspond to a single deduplicated, ETL'd row).

austinweisgrau/voterlist.md