Interface for our team to upload vendor lists from e.g. Catalist, L2, etc.
-
User must specify source -- Can use pre-existing sources, or can make their own which becomes part of the choices for other users
-
Users can see status of their uploaded list -- i.e., has it been ETL'd yet and loaded into the main voter list database for querying?
To be continued ...
- Identity of user who uploaded the vendor list
- Time of upload
- (maybe) Original date/time of file?
- Additional fields for more info about the voter list (for example, description, vendor SQL query, probably a lot more stuff I can't think of)
- Harmonize different fields of interest (e.g. first name, last name, age, race, gender, political affiliation, precinct, districts, ID, phone numbers, probably a lot more too...)
- Creation of new fields? (Like source, probably)
Updating records for someone who gets uploaded again (Will need to think about this issue more.)
Once data has been deduplicated, will need to ensure that there is a relationship stored between pre- and post- deduplicated rows. See ways to do this below under tables.
For 'standard' sources (like Catalist, L2, etc.) the ETL will be easier to achieve because of the consistency of fields present in those voter lists. However, for client-supplied data, we obviously cannot expect any consistency. ETL'ing those files will be more difficult.
Possible solutions:
- Completely automated approach -- use heuristics to try to find the appropriate columns (like age, race, etc.). Disadvantage is this will require complicated programming.
- User is forced to map certain columns from the client list to e.g. age, race, phone, address, etc -- All of the fields we want them to have. (And probably optional fields as well.) This can also include our own validation to ensure, for example, that what the user says is a phone field actually looks like a phone field. If it doesn't, we can flag the upload for review with optional notification to us on the tech team.
- Our consensus is to prefer the second option.
Ultimately, the ETL'd vendor lists should go into our Redshift
database. Putting it in Redshift gives us the advatange being
able to query these data in conjunction with the Spoke data
within the spokewarehouse
schema. (It also of course has the
advantage of using infrastructure that already exists.)
Schema name: voterlists
A table for raw data, like Catalist or L2.
Another table with normalized, ETL'd data.
A table mapping relationships between raw and normalized rows, OR a column in the raw data tables storing the foreign id key of the associated row in the ETL'd data table. This is a many-to-one relationship (many raw rows will correspond to a single deduplicated, ETL'd row).
Illustration of various potential data flows: https://docs.google.com/presentation/d/1IblvOsnBxiop_E1OnaZCw7yQmavsh-KgLLthXWgIWZw/edit#slide=id.p