Skip to content

Instantly share code, notes, and snippets.

@psbarnacle
Forked from ryw89/voterlist.md
Last active July 5, 2024 19:23
Show Gist options
  • Save psbarnacle/dade948b3c3d452268cf6019173ec3a9 to your computer and use it in GitHub Desktop.
Save psbarnacle/dade948b3c3d452268cf6019173ec3a9 to your computer and use it in GitHub Desktop.
Voter list management

Ingestion web interface

Interface for our team to upload vendor lists from e.g. Catalist, L2, etc.

User-facing features

  1. User must specify source -- Can use pre-existing sources, or can make their own which becomes part of the choices for other users

  2. User must specify the following about the campaign the list was for: Organization and goal type.

  3. User can optionally add SQL query of data (i.e., Catalist or TS SQL query)

  4. Optional catch-all field for notes

  5. Users can see status of their uploaded list -- i.e., has it been ETL'd yet and loaded into the main voter list database for querying?

Backend

Store information about every uploaded file to in a database.

  1. Original filename
  2. S3 bucket/key of compressed data (zip)
  3. Success / running / failure status of ETL
  4. Identity of user who uploaded the vendor list
  5. Time of upload
  6. (maybe) Original date/time of file?
  7. Additional fields for more info about the voter list (for example, description, vendor SQL query, campaign it was uploaded for)

Stuff that front-end should initially send to API:

Number 1, 4, 5, 6, 7

ETL

Data normalization

  • Harmonize different fields of interest (e.g. first name, last name, age, race, gender, political affiliation, precinct, districts, ID, phone numbers, probably a lot more too...)
  • Creation of new fields? (Like source, probably)

Data normalization tasks

  1. Discern vendor source
  2. Column renames (Might just make new columns though)
  3. Harmonizing data across vendors (like race, gender, etc.)
  • race: Asian, Black, Hispanic, Native American, Other, White, Unknown
  • sex/gender: female, male, unknown
  • party: Democratic, Republican, No Party, Unknown, Other (or maybe keeping all parties)
  • voter registration: registered, unregistered
  • voter status: active, inactive, unregistered
  • (revisit) voting history: election type/year and yes/no

Record Linkage

Updating records for someone who gets uploaded again (Will need to think about this issue more.)

Once data has been deduplicated, will need to ensure that there is a relationship stored between pre- and post- deduplicated rows. See ways to do this below under tables.

Application logic

For 'standard' sources (like Catalist, L2, etc.) the ETL will be easier to achieve because of the consistency of fields present in those voter lists. However, for client-supplied data, we obviously cannot expect any consistency. ETL'ing those files will be more difficult.

Possible solutions:

  • Completely automated approach -- use heuristics to try to find the appropriate columns (like age, race, etc.). Disadvantage is this will require complicated programming.
  • User is forced to map certain columns from the client list to e.g. age, race, phone, address, etc -- All of the fields we want them to have. (And probably optional fields as well.) This can also include our own validation to ensure, for example, that what the user says is a phone field actually looks like a phone field. If it doesn't, we can flag the upload for review with optional notification to us on the tech team.
  • Our consensus is to prefer the second option.

Database

Ultimately, the ETL'd vendor lists should go into our Redshift database. Putting it in Redshift gives us the advatange being able to query these data in conjunction with the Spoke data within the spokewarehouse schema. (It also of course has the advantage of using infrastructure that already exists.)

Schema name: voters

Tables

A table for raw data, like Catalist or L2.

Another table with normalized, ETL'd data.

A table mapping relationships between raw and normalized rows. This is a many-to-one relationship (many raw rows will correspond to a single deduplicated, ETL'd row).

Described Process Follows

  • Break up uploaded raw table into smaller data chunks. Add each section individually
  • Ex: take phone - look up phone on existing phone table. If phone number is already on phone table, extract phone_id
  • If phone number is new, add phone number to phone table and extract assigned phone_id
  • Repeat process for normalized address
  • Do name last. Names are somewhat unreliable, particularly first names.
  • Consider record matched when last_name, birthdate, and phone number match. address should be considered matched when concat(address_1, address_2) & (zip5 | city & state_code) match
  • Large numbers of people utilize their middle name as their first name. preferred_name should be the primary name usecase

Vendor list metadata table

Name: uploads

  • upload_id: Int (unix timestamp)
  • Source: SmallInt
  • Organization: SmallInt
  • Goal type: varchar
  • acquisition_date date (optional) SQL query of data: Varchar
  • Notes: Varchar
  • Status: Varchar
  • S3 url (raw data): Varchar

Source Table

  • source_id
  • source_name

changes

Organization Table

  • organization_id
  • organization_name

Temp Table for Matching

  • upload_id
  • upload_person_id
  • preferred_name
  • last_name
  • middle_name
  • first_name
  • suffix [??]
  • birthdate
  • address_id
  • phone_id

Address Table

Name: address

  • address_id (varchar)
  • street_address_1 (varchar)
  • street_address_2 (varchar)
  • city (varchar)
  • zip_5 (varchar 5)
  • zip_4 (varchar 4)
  • county (varchar)
  • state_code (varchar 2)

phone_uploads table (used for upload matching only)

  • upload_id
  • phone_upload_id
  • phone_number
  • source_id
  • quality_score

phone_uploads_phone_id table (bridges phone_uploads table to phone_id)

  • phone_upload_id
  • phone_id

Phone table (permanent table)

  • phone_id
  • phone_number (varchar)

Phone Carrier table

  • phone_id
  • carrier_id

Carrier table

  • carrier_id (1-6)
  • carrier_name (varchar)

person_phones Table

  • person_id
  • phone_id

Person Table

  • upload_person_id
  • person_id
  • preferred_name varchar
  • first_name varchar
  • middle_name varchar
  • last_name varchar
  • birthdate date
  • age [use age + unix_date]

Date preferred over unix date because many data sources normalize birthdates to Month/Year and insert a dummy number for day. Extraction of month/ year will provide stronger deduplication

person_phones table

Name: person_phones

  • ml_id
  • phone_id

address_table

--

Normalized view schema

Name: voters (Maybe change this)

Columns:

  • ID: Integer (INT4) -- Should have unique constraint
  • First name: Varchar
  • Last name: Varchar
  • Cell phone: Bigint
  • Street address 1: Varchar
  • Street address 2: Varchar
  • City: Varchar
  • County: Varchar
  • State: Char (length of 2)
  • House district: Varchar (Or maybe integer, if we can normalize)
  • Senate district: Varchar (Same as above)
  • Age: Smallint
  • Date of birth: Bigint
  • Gender: varchar
  • Race: Varchar
  • Party: Varchar
  • Voter registered: Bool
  • Voter active if registered: Bool

Name: history

Columns:

  • voter_id: Maps with "ID" in voters table
  • year
  • month
  • day
  • type: Type of election -- e..g "primary", "general", "local" -- We should define an exhaustive list of these types
  • voted: boolean
  • method: Manner of voter (in-person, mail, etc)

ID mapping table

Name: spoke_id

Columns:

  • id: Bigint (our own internal ID for the person)
  • campaign_contact_id: Bigint

Name: vendor_id

Columns:

  • id: Bigint (our own internal ID for the person)
  • vendor_id: Varchar

If IDs can possibly be the same across vendors, will need another column with the vendor the ID is from.

SQL for building tables

-- SQL for creating database tables
CREATE TABLE voters (
id int,
first_name varchar,
middle_name varchar,
last_name varchar,
cell bigint,
age smallint,
gender varchar,
race varchar,
voter_status varchar,
vendor_id int,
registration_address_line_1 varchar,
registration_address_line_2 varchar,
registration_city varchar,
registration_state varchar,
registration_zipcode int,
mailing_street_address_1 varchar,
mailing_street_address_2 varchar,
mailing_city varchar,
mailing_state varchar,
mailing_zipcode int,
county varchar,
precinct varchar,
congressional_district varchar,
state_house_district varchar,
state_senate_district varchar,
party varchar,
date_of_birth date
);

CREATE TABLE spoke_id (
id int,
campaign_contact_id int
);

CREATE TABLE vendor_id (
id int,
vendor_id varchar
);

Storing vendor lists

  • There should be a table storing references to all of our uploaded vendor lists (which will be stored in S3)
  • How to make these vendor lists queryable? NoSQL, AWS Athena (transform CSVs to Parquet)?

Misc. notes

Identifying positive responses

  • NLP to try to find most positive responses?
  • Marking responses in Spoke itself? Or some other step in the campaign building process?
  • Subcategories of positive responses

Various suggestions:

  • Spit out cleaned data for Spoke upload?
  • Stop uploading so many custom fields to Spoke? Just store them in this new database? (Our current consensus of what is best -- 12/3/20).

Record linkage

Type of data available: Various demographic and personal data across vendors (like birth day, phone number, name, address, etc.)

Spoke data: Wrong number tags, custom fields

Notes (12/4/31)

Unresolved questions and additional ideas:

  1. Dealing with custom fields from voter file: Should they 1) All be uploaded to Spoke? 2) A curated subset uploaded to Spoke? 3) Not uploaded to Spoke but just sent to data warehouse?

  2. How to label responses? Integrate with reporting app? Labeling responses in Spoke itself?

  3. List cleaning -- Should the list upload program also clean the list for the user? (Almost certainly so, since this is easy to implement and seems to have no downsides.)

  4. In our Spoke data, we haven't labeled addresses. Heuristics based on goal_type field:

    Reg Address:

    • Candidate ID
    • Voter Edu
    • Voter Reg
    • Voter Persuasion
    • GOTV
    • VBM
    • Take Action
    • Candidate Recruit
    • Fundraising (this one could go either way, but we normally get these lists from clients, so let’s assume reg address)

    Mailing Address:

    • Event Invite
    • Vol Recruit
    • Share Resources
  5. Which address to use as default if we only have one (like from a client)? (Probably should use registration.)

  6. Additional fields to keep: (discussed on 01-14-2021)

    • cell phone confidence scores (assignment & connectivity)
    • landline number
    • city council district
    • partisan score
    • email address
    • vote history (this one is annoying, but we probably should deal with this)
  7. Confirm cell phone as existing if worked in Spoke campaign. If used first name, also more likely to be a correctly assigned number.

  8. Using errors from Spoke database to identify landlines, undeliverables.

NDA

Marking uploads as do not use.

Wrong number

How to handle wrong number tags?

Notes (1/6/31)

Filter out wrong numbers and opt-outs from front-end for getting data from database.

CSV Columing Mapping

Accept POST requests with JSON in the following format:

{
  "headers": [
    "some",
    "list",
    "of",
    "CSV",
    "headers"
  ],
  "map_cols": {
    "ID": {
      "required": false
    },
    "First name": {
      "required": false
    },
    "Last name": {
      "required": false
    },
    "Cell phone": {
      "required": true
    },
    "Street address 1": {
      "required": false
    },
    "Street address 2": {
      "required": false
    },
    "Combined address": {
      "required": false
    },
    "City": {
      "required": false
    },
    "County": {
      "required": false
    },
    "State": {
      "required": false
    },
    "House district": {
      "required": false
    },
    "Senate district": {
      "required": false
    },
    "Age": {
      "required": false
    },
    "Date of birth": {
      "required": false
    },
    "Gender": {
      "required": false
    },
    "Race": {
      "required": false
    },
    "Party": {
      "required": false
    },
    "Voter registration": {
      "required": false
    },
    "Voter activity (if registered)": {
      "required": false
    }
  }
}

When the user clicks "Submit", it will then POST the results to a endpoint (TBD, something like localhost:5000/api/colmap)

The POSTED results should be in this format:

{
  "ID": "some_col_from_the_csv",
  "First name": "some_other_col"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment