Skip to content

Instantly share code, notes, and snippets.

@rajinwonderland
Created July 24, 2019 03:29
Show Gist options
  • Save rajinwonderland/5a5e56d8112eda9bd59e3136aabbb6a0 to your computer and use it in GitHub Desktop.
Save rajinwonderland/5a5e56d8112eda9bd59e3136aabbb6a0 to your computer and use it in GitHub Desktop.
Problem Solving Exercise – WIP

Problem Solving 101

Background

We have an uncleansed raw dataset containing EPA Water System Violations. These violations are associated with Water Districts or Water Systems who supply our everyday drinking water.

The Data

In a real world situation, you'd be responsible for cleansing and wrangling the dataset into a readable datamodel. However, for this exercise, you will be delivered the sampled set in a pre-cleansed SQL database environment

Our dataset contains a lot of data with the following schema:

Field name Type Mode Description
SystemID STRING NULLABLE
WaterSystemName STRING NULLABLE
WaterSystemType STRING NULLABLE
County STRING NULLABLE
City STRING NULLABLE
TotalViolations INTEGER NULLABLE
state_name STRING NULLABLE
State_Abb STRING NULLABLE

This data contains 141,411 rows of data that sizes up to around 12.8mb

The Challenge

Here's what the client is asking us

Thanks Raj. As discussed, can we see how many violations are followups to the same initial violation - can we dedupe using a violation ID of some sort?

With the following dataset, come up with a strategy for approaching the deduplication of this data

Step 1:

Gather an aggregated set of violations that are associated to a specific Water Syster or PWS_Name

Example Query:

 SELECT
  PWS_Name,
  COUNT(PWS_Name) AS violations
FROM
  `massive-pen-243023.water.violations`
GROUP BY
  PWS_Name
ORDER BY
  violations DESC

Result:

Row PWS_Name violations
1 PIERPOINT WATER 32560
2 OTOOLE WATER 22269
3 ORO VALLEY WATER UTILITY 11645
4 METROPOLITAN DWID 8453
5 SANTIAM MEADOWS COMMUNITY WS 7510
6 ARCTIC VILLAGE WATER SYSTEM 7432

Step 2:

From Step 1's results, we can use ARTIC VILLAGE WATER SYSTEM as our guinea pig.

Now we will run another query to retrieve violations which are associated to ARTIC VILLAGE WATER SYSTEMS

Example Query:

SELECT
  *
FROM
  water.violations
WHERE
  PWS_Name = "ARCTIC VILLAGE WATER SYSTEM"

The result set is pretty large, so we'll use a json response for brevity

Result:

[
  {
    "PWS_ID": "AK2300222",
    "PWS_Name": "ARCTIC VILLAGE WATER SYSTEM",
    "EPA_Region": "Region 10",
    "Primacy_Agency": "Alaska",
    "PWS_Type": "Community water system",
    "Primacy_Type": "State",
    "Primary_Source": "Surface water",
    "Activity_Status": "Active",
    "Deactivation_Date": "-",
    "Population_Served_Count": "175",
    "Rule_Name": "Long Term 1 Enhanced Surface Water Treatment Rule",
    "Violation_Code": "38",
    "Violation_Type": "Monitoring, Turbidity (Enhanced SWTR)",
    "Violation_Category_Code": "MR",
    "Is_Health_Based": "false",
    "Contaminant_Name": "Interim Enhanced Surfsce Water Treatment Rule",
    "Compliance_Period_Begin_Date": "2016-02-01T00:00:00",
    "Compliance_Period_End_Date": "29-FEB-2016",
    "Compliance_Status": "Returned to Compliance",
    "RTC_Date": "08-APR-2016",
    "Enforcement_Action_Type_Code": "SOX",
    "Enforcement_Action_Description": "State Compliance achieved",
    "Is_Major_Violation": "Y",
    "Severity_Indicator_Count": "-",
    "Public_Notification_Tier": "3",
    "Violation_First_Reported_Date": "2016-05-24T00:00:00"
  },
  {
    "PWS_ID": "AK2300222",
    "PWS_Name": "ARCTIC VILLAGE WATER SYSTEM",
    "EPA_Region": "Region 10",
    "Primacy_Agency": "Alaska",
    "PWS_Type": "Community water system",
    "Primacy_Type": "State",
    "Primary_Source": "Surface water",
    "Activity_Status": "Active",
    "Deactivation_Date": "-",
    "Population_Served_Count": "175",
    "Rule_Name": "Long Term 1 Enhanced Surface Water Treatment Rule",
    "Violation_Code": "38",
    "Violation_Type": "Monitoring, Turbidity (Enhanced SWTR)",
    "Violation_Category_Code": "MR",
    "Is_Health_Based": "false",
    "Contaminant_Name": "Interim Enhanced Surfsce Water Treatment Rule",
    "Compliance_Period_Begin_Date": "2017-01-01T00:00:00",
    "Compliance_Period_End_Date": "31-JAN-2017",
    "Compliance_Status": "Returned to Compliance",
    "RTC_Date": "21-APR-2017",
    "Enforcement_Action_Type_Code": "SIB",
    "Enforcement_Action_Description": "State Compliance Meeting conducted",
    "Is_Major_Violation": "Y",
    "Severity_Indicator_Count": "-",
    "Public_Notification_Tier": "3",
    "Violation_First_Reported_Date": "2017-05-08T00:00:00"
  }
  // ...7432 more rows
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment