-
File gets uploaded
-
Add data to vendor list metadata table
-
Clean all of the columns (list may not be complete)
a. DOB
b. Address
c. Names
d. Age
e. Race
f. Gender
g. Party affiliation
h. Voter registration status
-
Match each column individually against the database's tables
i. Data found: Extract ID
j. Data not found: Insert record, assign ID & extract ID
-
Deduping
k. Construct a unique identifier temp table
i. Last name, first name, calculated age (within a year > tolerance), phone number (optional?), address ii. Last name & first name matches, plus phone number or > (matching age +/- 1 year and matching full address) → Is > match (maybe change this? Recall Austin's comments) iii. Eg. Last name & first name matches, but no matching phone > number or age + full address → New record
l. Search temp table against existing database
-
Construct storage table
m. If a new record is found:
iv. Person ID (create) v. Bio info (names, age, suffix, DOB, race, gender) vi. Phone ID vii. Address ID viii. Voter registration ID ix. Voter history ID x. File upload ID
-
Update records as needed (e.g. new vote history data)
n. Heuristics for constructing "canonical" records-- Most basic > one is to assume more recent data is accurate
This is no longer step-by-step:
-
Phones -- Keep vendor confidence score & vendor name
-
Store metadata of where field is from (e.g age is from
targetsmart_whatever.csv) -- but of course use the foreign key
https://docs.google.com/presentation/d/1a_t2s3yG5yTPeiui0Dir_lQzCYFY2Vw6xfIGdG_CVu0/edit#slide=id.p
Interface for our team to upload vendor lists from e.g. Catalist, L2, etc.
-
User must specify source -- Can use pre-existing sources, or can make their own which becomes part of the choices for other users
-
User must specify the following about the campaign the list was for: Organization and goal type.
-
User can optionally add SQL query of data (i.e., Catalist or TS SQL query)
-
Optional catch-all field for notes
-
Users can see status of their uploaded list -- i.e., has it been ETL'd yet and loaded into the main voter list database for querying?
Store information about every uploaded file to in a database.
- Original filename
- S3 bucket/key of compressed data (zip)
- Success / running / failure status of ETL
- Identity of user who uploaded the vendor list
- Time of upload
- (maybe) Original date/time of file?
- Additional fields for more info about the voter list (for example, description, vendor SQL query, campaign it was uploaded for)
Stuff that front-end should initially send to API:
Number 1, 4, 5, 6, 7
- Harmonize different fields of interest (e.g. first name, last name, age, race, gender, political affiliation, precinct, districts, ID, phone numbers, probably a lot more too...)
- Creation of new fields? (Like source, probably)
- Discern vendor source
- Column renames (Might just make new columns though)
- Harmonizing data across vendors (like race, gender, etc.)
- race: Asian, Black, Hispanic, Native American, Other, White, Unknown
- sex/gender: female, male, unknown
- party: Democratic, Republican, No Party, Unknown, Other (or maybe keeping all parties)
- voter registration: registered, unregistered
- voter status: active, inactive, unregistered
- (revisit) voting history: election type/year and yes/no
Updating records for someone who gets uploaded again (Will need to think about this issue more.)
Once data has been deduplicated, will need to ensure that there is a relationship stored between pre- and post- deduplicated rows. See ways to do this below under tables.
For 'standard' sources (like Catalist, L2, etc.) the ETL will be easier to achieve because of the consistency of fields present in those voter lists. However, for client-supplied data, we obviously cannot expect any consistency. ETL'ing those files will be more difficult.
Possible solutions:
- Completely automated approach -- use heuristics to try to find the appropriate columns (like age, race, etc.). Disadvantage is this will require complicated programming.
- User is forced to map certain columns from the client list to e.g. age, race, phone, address, etc -- All of the fields we want them to have. (And probably optional fields as well.) This can also include our own validation to ensure, for example, that what the user says is a phone field actually looks like a phone field. If it doesn't, we can flag the upload for review with optional notification to us on the tech team.
- Our consensus is to prefer the second option.
Ultimately, the ETL'd vendor lists should go into our Redshift
database. Putting it in Redshift gives us the advatange being
able to query these data in conjunction with the Spoke data
within the spokewarehouse
schema. (It also of course has the
advantage of using infrastructure that already exists.)
Schema name: voters
A table for raw data, like Catalist or L2.
Another table with normalized, ETL'd data.
A table mapping relationships between raw and normalized rows. This is a many-to-one relationship (many raw rows will correspond to a single deduplicated, ETL'd row).
- Break up uploaded raw table into smaller data chunks. Add each section individually
- Ex: take phone - look up phone on existing phone table. If phone number is already on phone table, extract phone_id
- If phone number is new, add phone number to phone table and extract assigned phone_id
- Repeat process for normalized address
- Do name last. Names are somewhat unreliable, particularly first names.
- Consider record matched when last_name, birthdate, and phone number match. address should be considered matched when concat(address_1, address_2) & (zip5 | city & state_code) match
- Large numbers of people utilize their middle name as their first name. preferred_name should be the primary name usecase
Name: uploads
- upload_id: Int (unix timestamp)
- Source: SmallInt
- Organization: SmallInt
- Goal type: varchar
- acquisition_date date (optional) SQL query of data: Varchar
- Notes: Varchar
- Status: Varchar
- S3 url (raw data): Varchar
- source_id
- source_name
- organization_id
- organization_name
- upload_id
- upload_person_id
- preferred_name
- last_name
- middle_name
- first_name
- suffix [??]
- birthdate
- address_id
- phone_id
Name: address
- address_id (varchar)
- street_address_1 (varchar)
- street_address_2 (varchar)
- city (varchar)
- zip_5 (varchar 5)
- zip_4 (varchar 4)
- county (varchar)
- state_code (varchar 2)
- upload_id
- phone_upload_id
- phone_number
- source_id
- quality_score
- phone_upload_id
- phone_id
- phone_id
- phone_number (varchar)
- phone_id
- carrier_id
- carrier_id (1-6)
- carrier_name (varchar)
- person_id
- phone_id
- upload_person_id
- person_id
- preferred_name varchar
- first_name varchar
- middle_name varchar
- last_name varchar
- birthdate date
- age [use age + unix_date]
Date preferred over unix date because many data sources normalize birthdates to Month/Year and insert a dummy number for day. Extraction of month/ year will provide stronger deduplication
Name: person_phones
- ml_id
- phone_id
--
Name: voters
(Maybe change this)
Columns:
- ID: Integer (INT4) -- Should have unique constraint
- First name: Varchar
- Last name: Varchar
- Cell phone: Bigint
- Street address 1: Varchar
- Street address 2: Varchar
- City: Varchar
- County: Varchar
- State: Char (length of 2)
- House district: Varchar (Or maybe integer, if we can normalize)
- Senate district: Varchar (Same as above)
- Age: Smallint
- Date of birth: Bigint
- Gender: varchar
- Race: Varchar
- Party: Varchar
- Voter registered: Bool
- Voter active if registered: Bool
Name: history
Columns:
- voter_id: Maps with "ID" in
voters
table - year
- month
- day
- type: Type of election -- e..g "primary", "general", "local" -- We should define an exhaustive list of these types
- voted: boolean
- method: Manner of voter (in-person, mail, etc)
Name: spoke_id
Columns:
- id: Bigint (our own internal ID for the person)
- campaign_contact_id: Bigint
Name: vendor_id
Columns:
- id: Bigint (our own internal ID for the person)
- vendor_id: Varchar
If IDs can possibly be the same across vendors, will need another column with the vendor the ID is from.
-- SQL for creating database tables
CREATE TABLE voters (
id int,
first_name varchar,
middle_name varchar,
last_name varchar,
cell bigint,
age smallint,
gender varchar,
race varchar,
voter_status varchar,
vendor_id int,
registration_address_line_1 varchar,
registration_address_line_2 varchar,
registration_city varchar,
registration_state varchar,
registration_zipcode int,
mailing_street_address_1 varchar,
mailing_street_address_2 varchar,
mailing_city varchar,
mailing_state varchar,
mailing_zipcode int,
county varchar,
precinct varchar,
congressional_district varchar,
state_house_district varchar,
state_senate_district varchar,
party varchar,
date_of_birth date
);
CREATE TABLE spoke_id (
id int,
campaign_contact_id int
);
CREATE TABLE vendor_id (
id int,
vendor_id varchar
);
- There should be a table storing references to all of our uploaded vendor lists (which will be stored in S3)
- How to make these vendor lists queryable? NoSQL, AWS Athena (transform CSVs to Parquet)?
Identifying positive responses
- NLP to try to find most positive responses?
- Marking responses in Spoke itself? Or some other step in the campaign building process?
- Subcategories of positive responses
Various suggestions:
- Spit out cleaned data for Spoke upload?
- Stop uploading so many custom fields to Spoke? Just store them in this new database? (Our current consensus of what is best -- 12/3/20).
Type of data available: Various demographic and personal data across vendors (like birth day, phone number, name, address, etc.)
Spoke data: Wrong number tags, custom fields
Unresolved questions and additional ideas:
-
Dealing with custom fields from voter file: Should they 1) All be uploaded to Spoke? 2) A curated subset uploaded to Spoke? 3) Not uploaded to Spoke but just sent to data warehouse?
-
How to label responses? Integrate with reporting app? Labeling responses in Spoke itself?
-
List cleaning -- Should the list upload program also clean the list for the user? (Almost certainly so, since this is easy to implement and seems to have no downsides.)
-
In our Spoke data, we haven't labeled addresses. Heuristics based on goal_type field:
Reg Address:
- Candidate ID
- Voter Edu
- Voter Reg
- Voter Persuasion
- GOTV
- VBM
- Take Action
- Candidate Recruit
- Fundraising (this one could go either way, but we normally get these lists from clients, so let’s assume reg address)
Mailing Address:
- Event Invite
- Vol Recruit
- Share Resources
-
Which address to use as default if we only have one (like from a client)? (Probably should use registration.)
-
Additional fields to keep: (discussed on 01-14-2021)
- cell phone confidence scores (assignment & connectivity)
- landline number
- city council district
- partisan score
- email address
- vote history (this one is annoying, but we probably should deal with this)
-
Confirm cell phone as existing if worked in Spoke campaign. If used first name, also more likely to be a correctly assigned number.
-
Using errors from Spoke database to identify landlines, undeliverables.
Marking uploads as do not use.
How to handle wrong number tags?
Filter out wrong numbers and opt-outs from front-end for getting data from database.
Accept POST requests with JSON in the following format:
{
"headers": [
"some",
"list",
"of",
"CSV",
"headers"
],
"map_cols": {
"ID": {
"required": false
},
"First name": {
"required": false
},
"Last name": {
"required": false
},
"Cell phone": {
"required": true
},
"Street address 1": {
"required": false
},
"Street address 2": {
"required": false
},
"Combined address": {
"required": false
},
"City": {
"required": false
},
"County": {
"required": false
},
"State": {
"required": false
},
"House district": {
"required": false
},
"Senate district": {
"required": false
},
"Age": {
"required": false
},
"Date of birth": {
"required": false
},
"Gender": {
"required": false
},
"Race": {
"required": false
},
"Party": {
"required": false
},
"Voter registration": {
"required": false
},
"Voter activity (if registered)": {
"required": false
}
}
}
When the user clicks "Submit", it will then POST the results to a endpoint (TBD, something like localhost:5000/api/colmap
)
The POSTED results should be in this format:
{
"ID": "some_col_from_the_csv",
"First name": "some_other_col"
}
From Austin:
Illustration of various potential data flows: https://docs.google.com/presentation/d/1IblvOsnBxiop_E1OnaZCw7yQmavsh-KgLLthXWgIWZw/edit#slide=id.p