Ingestion workflow (newer, from 03-17-21)

File gets uploaded
Add data to vendor list metadata table
Clean all of the columns (list may not be complete)

a. DOB

b. Address

c. Names

d. Age

e. Race

f. Gender

g. Party affiliation

h. Voter registration status
Match each column individually against the database's tables

i. Data found: Extract ID

j. Data not found: Insert record, assign ID & extract ID

Deduping

k. Construct a unique identifier temp table

i.  Last name, first name, calculated age (within a year
    > tolerance), phone number (optional?), address

ii. Last name & first name matches, plus phone number or
    > (matching age +/- 1 year and matching full address) → Is
    > match (maybe change this? Recall Austin's comments)

iii. Eg. Last name & first name matches, but no matching phone
    > number or age + full address → New record

l. Search temp table against existing database

Construct storage table

m. If a new record is found:

iv. Person ID (create)

v.  Bio info (names, age, suffix, DOB, race, gender)

vi. Phone ID

vii. Address ID

viii. Voter registration ID

ix. Voter history ID

x.  File upload ID

Update records as needed (e.g. new vote history data)

n. Heuristics for constructing "canonical" records-- Most basic > one is to assume more recent data is accurate

This is no longer step-by-step:

Phones -- Keep vendor confidence score & vendor name
Store metadata of where field is from (e.g age is from

targetsmart_whatever.csv) -- but of course use the foreign key

https://docs.google.com/presentation/d/1a_t2s3yG5yTPeiui0Dir_lQzCYFY2Vw6xfIGdG_CVu0/edit#slide=id.p

Ingestion web interface

Interface for our team to upload vendor lists from e.g. Catalist, L2, etc.

User-facing features

User must specify source -- Can use pre-existing sources, or can make their own which becomes part of the choices for other users
User must specify the following about the campaign the list was for: Organization and goal type.
User can optionally add SQL query of data (i.e., Catalist or TS SQL query)
Optional catch-all field for notes
Users can see status of their uploaded list -- i.e., has it been ETL'd yet and loaded into the main voter list database for querying?

Backend

Store information about every uploaded file to in a database.

Original filename
S3 bucket/key of compressed data (zip)
Success / running / failure status of ETL
Identity of user who uploaded the vendor list
Time of upload
(maybe) Original date/time of file?
Additional fields for more info about the voter list (for example, description, vendor SQL query, campaign it was uploaded for)

Stuff that front-end should initially send to API:

Number 1, 4, 5, 6, 7

ETL

Data normalization

Harmonize different fields of interest (e.g. first name, last name, age, race, gender, political affiliation, precinct, districts, ID, phone numbers, probably a lot more too...)
Creation of new fields? (Like source, probably)

Data normalization tasks

Discern vendor source
Column renames (Might just make new columns though)
Harmonizing data across vendors (like race, gender, etc.)

race: Asian, Black, Hispanic, Native American, Other, White, Unknown
sex/gender: female, male, unknown
party: Democratic, Republican, No Party, Unknown, Other (or maybe keeping all parties)
voter registration: registered, unregistered
voter status: active, inactive, unregistered
(revisit) voting history: election type/year and yes/no

Record Linkage

Updating records for someone who gets uploaded again (Will need to think about this issue more.)

Once data has been deduplicated, will need to ensure that there is a relationship stored between pre- and post- deduplicated rows. See ways to do this below under tables.

Application logic

For 'standard' sources (like Catalist, L2, etc.) the ETL will be easier to achieve because of the consistency of fields present in those voter lists. However, for client-supplied data, we obviously cannot expect any consistency. ETL'ing those files will be more difficult.

Possible solutions:

Completely automated approach -- use heuristics to try to find the appropriate columns (like age, race, etc.). Disadvantage is this will require complicated programming.
User is forced to map certain columns from the client list to e.g. age, race, phone, address, etc -- All of the fields we want them to have. (And probably optional fields as well.) This can also include our own validation to ensure, for example, that what the user says is a phone field actually looks like a phone field. If it doesn't, we can flag the upload for review with optional notification to us on the tech team.
Our consensus is to prefer the second option.

Database

Ultimately, the ETL'd vendor lists should go into our Redshift database. Putting it in Redshift gives us the advatange being able to query these data in conjunction with the Spoke data within the spokewarehouse schema. (It also of course has the advantage of using infrastructure that already exists.)

Schema name: voters

Tables

A table for raw data, like Catalist or L2.

Another table with normalized, ETL'd data.

A table mapping relationships between raw and normalized rows. This is a many-to-one relationship (many raw rows will correspond to a single deduplicated, ETL'd row).

Described Process Follows

Break up uploaded raw table into smaller data chunks. Add each section individually
Ex: take phone - look up phone on existing phone table. If phone number is already on phone table, extract phone_id
If phone number is new, add phone number to phone table and extract assigned phone_id
Repeat process for normalized address
Do name last. Names are somewhat unreliable, particularly first names.
Consider record matched when last_name, birthdate, and phone number match. address should be considered matched when concat(address_1, address_2) & (zip5 | city & state_code) match
Large numbers of people utilize their middle name as their first name. preferred_name should be the primary name usecase

Vendor list metadata table

Name: uploads

upload_id: Int (unix timestamp)
Source: SmallInt
Organization: SmallInt
Goal type: varchar
acquisition_date date (optional) SQL query of data: Varchar
Notes: Varchar
Status: Varchar
S3 url (raw data): Varchar

Source Table

source_id
source_name

changes

Organization Table

organization_id
organization_name

Temp Table for Matching

upload_id
upload_person_id
preferred_name
last_name
middle_name
first_name
suffix [??]
birthdate
address_id
phone_id

Address Table

Name: address

address_id (varchar)
street_address_1 (varchar)
street_address_2 (varchar)
city (varchar)
zip_5 (varchar 5)
zip_4 (varchar 4)
county (varchar)
state_code (varchar 2)

phone_uploads table (used for upload matching only)

upload_id
phone_upload_id
phone_number
source_id
quality_score

phone_uploads_phone_id table (bridges phone_uploads table to phone_id)

phone_upload_id
phone_id

Phone table (permanent table)

phone_id
phone_number (varchar)

Phone Carrier table

phone_id
carrier_id

Carrier table

carrier_id (1-6)
carrier_name (varchar)

person_phones Table

person_id
phone_id

Person Table

upload_person_id
person_id
preferred_name varchar
first_name varchar
middle_name varchar
last_name varchar
birthdate date
age [use age + unix_date]

Date preferred over unix date because many data sources normalize birthdates to Month/Year and insert a dummy number for day. Extraction of month/ year will provide stronger deduplication

person_phones table

Name: person_phones

ml_id
phone_id

address_table

Normalized view schema

Name: voters (Maybe change this)

Columns:

ID: Integer (INT4) -- Should have unique constraint
First name: Varchar
Last name: Varchar
Cell phone: Bigint
Street address 1: Varchar
Street address 2: Varchar
City: Varchar
County: Varchar
State: Char (length of 2)
House district: Varchar (Or maybe integer, if we can normalize)
Senate district: Varchar (Same as above)
Age: Smallint
Date of birth: Bigint
Gender: varchar
Race: Varchar
Party: Varchar
Voter registered: Bool
Voter active if registered: Bool

Name: history

Columns:

voter_id: Maps with "ID" in voters table
year
month
day
type: Type of election -- e..g "primary", "general", "local" -- We should define an exhaustive list of these types
voted: boolean
method: Manner of voter (in-person, mail, etc)

ID mapping table

Name: spoke_id

Columns:

id: Bigint (our own internal ID for the person)
campaign_contact_id: Bigint

Name: vendor_id

Columns:

id: Bigint (our own internal ID for the person)
vendor_id: Varchar

If IDs can possibly be the same across vendors, will need another column with the vendor the ID is from.

SQL for building tables

-- SQL for creating database tables
CREATE TABLE voters (
id int,
first_name varchar,
middle_name varchar,
last_name varchar,
cell bigint,
age smallint,
gender varchar,
race varchar,
voter_status varchar,
vendor_id int,
registration_address_line_1 varchar,
registration_address_line_2 varchar,
registration_city varchar,
registration_state varchar,
registration_zipcode int,
mailing_street_address_1 varchar,
mailing_street_address_2 varchar,
mailing_city varchar,
mailing_state varchar,
mailing_zipcode int,
county varchar,
precinct varchar,
congressional_district varchar,
state_house_district varchar,
state_senate_district varchar,
party varchar,
date_of_birth date
);

CREATE TABLE spoke_id (
id int,
campaign_contact_id int
);

CREATE TABLE vendor_id (
id int,
vendor_id varchar
);

Storing vendor lists

There should be a table storing references to all of our uploaded vendor lists (which will be stored in S3)
How to make these vendor lists queryable? NoSQL, AWS Athena (transform CSVs to Parquet)?

Misc. notes

Identifying positive responses

NLP to try to find most positive responses?
Marking responses in Spoke itself? Or some other step in the campaign building process?
Subcategories of positive responses

Various suggestions:

Spit out cleaned data for Spoke upload?
Stop uploading so many custom fields to Spoke? Just store them in this new database? (Our current consensus of what is best -- 12/3/20).

Record linkage

Type of data available: Various demographic and personal data across vendors (like birth day, phone number, name, address, etc.)

Spoke data: Wrong number tags, custom fields

Notes (12/4/31)

Unresolved questions and additional ideas:

Dealing with custom fields from voter file: Should they 1) All be uploaded to Spoke? 2) A curated subset uploaded to Spoke? 3) Not uploaded to Spoke but just sent to data warehouse?
How to label responses? Integrate with reporting app? Labeling responses in Spoke itself?
List cleaning -- Should the list upload program also clean the list for the user? (Almost certainly so, since this is easy to implement and seems to have no downsides.)
In our Spoke data, we haven't labeled addresses. Heuristics based on goal_type field:

Reg Address:
- Candidate ID
- Voter Edu
- Voter Reg
- Voter Persuasion
- GOTV
- VBM
- Take Action
- Candidate Recruit
- Fundraising (this one could go either way, but we normally get these lists from clients, so let’s assume reg address)
Mailing Address:
- Event Invite
- Vol Recruit
- Share Resources
Which address to use as default if we only have one (like from a client)? (Probably should use registration.)
Additional fields to keep: (discussed on 01-14-2021)
- cell phone confidence scores (assignment & connectivity)
- landline number
- city council district
- partisan score
- email address
- vote history (this one is annoying, but we probably should deal with this)
Confirm cell phone as existing if worked in Spoke campaign. If used first name, also more likely to be a correctly assigned number.
Using errors from Spoke database to identify landlines, undeliverables.

NDA

Marking uploads as do not use.

Wrong number

How to handle wrong number tags?

Notes (1/6/31)

Filter out wrong numbers and opt-outs from front-end for getting data from database.

CSV Columing Mapping

Accept POST requests with JSON in the following format:

{
  "headers": [
    "some",
    "list",
    "of",
    "CSV",
    "headers"
  ],
  "map_cols": {
    "ID": {
      "required": false
    },
    "First name": {
      "required": false
    },
    "Last name": {
      "required": false
    },
    "Cell phone": {
      "required": true
    },
    "Street address 1": {
      "required": false
    },
    "Street address 2": {
      "required": false
    },
    "Combined address": {
      "required": false
    },
    "City": {
      "required": false
    },
    "County": {
      "required": false
    },
    "State": {
      "required": false
    },
    "House district": {
      "required": false
    },
    "Senate district": {
      "required": false
    },
    "Age": {
      "required": false
    },
    "Date of birth": {
      "required": false
    },
    "Gender": {
      "required": false
    },
    "Race": {
      "required": false
    },
    "Party": {
      "required": false
    },
    "Voter registration": {
      "required": false
    },
    "Voter activity (if registered)": {
      "required": false
    }
  }
}

When the user clicks "Submit", it will then POST the results to a endpoint (TBD, something like localhost:5000/api/colmap)

The POSTED results should be in this format:

{
  "ID": "some_col_from_the_csv",
  "First name": "some_other_col"
}

ryw89/voterlist.md

Ingestion workflow (newer, from 03-17-21)

Ingestion web interface

User-facing features

Backend

ETL

Data normalization

Data normalization tasks

Record Linkage

Application logic

Database

Tables

Described Process Follows

Vendor list metadata table

Source Table

changes

Organization Table

Temp Table for Matching

Address Table

phone_uploads table (used for upload matching only)

phone_uploads_phone_id table (bridges phone_uploads table to phone_id)

Phone table (permanent table)

Phone Carrier table

Carrier table

person_phones Table

Person Table

Date preferred over unix date because many data sources normalize birthdates to Month/Year and insert a dummy number for day. Extraction of month/ year will provide stronger deduplication

person_phones table

address_table

Normalized view schema

ID mapping table

SQL for building tables

Storing vendor lists

Misc. notes

Record linkage

Notes (12/4/31)

NDA

Wrong number

Notes (1/6/31)

CSV Columing Mapping

ryw89 commented Dec 15, 2020