We have an uncleansed raw dataset containing EPA Water System Violations. These violations are associated with Water Districts or Water Systems who supply our everyday drinking water.
In a real world situation, you'd be responsible for cleansing and wrangling the dataset into a readable datamodel. However, for this exercise, you will be delivered the sampled set in a pre-cleansed SQL database environment
Our dataset contains a lot of data with the following schema:
Field name | Type | Mode | Description |
---|---|---|---|
SystemID | STRING | NULLABLE | |
WaterSystemName | STRING | NULLABLE | |
WaterSystemType | STRING | NULLABLE | |
County | STRING | NULLABLE | |
City | STRING | NULLABLE | |
TotalViolations | INTEGER | NULLABLE | |
state_name | STRING | NULLABLE | |
State_Abb | STRING | NULLABLE |
This data contains 141,411 rows of data that sizes up to around 12.8mb
Here's what the client is asking us
Thanks Raj. As discussed, can we see how many violations are followups to the same initial violation - can we dedupe using a violation ID of some sort?
With the following dataset, come up with a strategy for approaching the deduplication of this data
Gather an aggregated set of violations that are associated to a specific Water Syster or PWS_Name
Example Query:
SELECT
PWS_Name,
COUNT(PWS_Name) AS violations
FROM
`massive-pen-243023.water.violations`
GROUP BY
PWS_Name
ORDER BY
violations DESC
Result:
Row | PWS_Name | violations |
---|---|---|
1 | PIERPOINT WATER | 32560 |
2 | OTOOLE WATER | 22269 |
3 | ORO VALLEY WATER UTILITY | 11645 |
4 | METROPOLITAN DWID | 8453 |
5 | SANTIAM MEADOWS COMMUNITY WS | 7510 |
6 | ARCTIC VILLAGE WATER SYSTEM | 7432 |
From Step 1's results, we can use ARTIC VILLAGE WATER SYSTEM
as our guinea pig.
Now we will run another query to retrieve violations which are associated to ARTIC VILLAGE WATER SYSTEMS
Example Query:
SELECT
*
FROM
water.violations
WHERE
PWS_Name = "ARCTIC VILLAGE WATER SYSTEM"
The result set is pretty large, so we'll use a json response for brevity
Result:
[
{
"PWS_ID": "AK2300222",
"PWS_Name": "ARCTIC VILLAGE WATER SYSTEM",
"EPA_Region": "Region 10",
"Primacy_Agency": "Alaska",
"PWS_Type": "Community water system",
"Primacy_Type": "State",
"Primary_Source": "Surface water",
"Activity_Status": "Active",
"Deactivation_Date": "-",
"Population_Served_Count": "175",
"Rule_Name": "Long Term 1 Enhanced Surface Water Treatment Rule",
"Violation_Code": "38",
"Violation_Type": "Monitoring, Turbidity (Enhanced SWTR)",
"Violation_Category_Code": "MR",
"Is_Health_Based": "false",
"Contaminant_Name": "Interim Enhanced Surfsce Water Treatment Rule",
"Compliance_Period_Begin_Date": "2016-02-01T00:00:00",
"Compliance_Period_End_Date": "29-FEB-2016",
"Compliance_Status": "Returned to Compliance",
"RTC_Date": "08-APR-2016",
"Enforcement_Action_Type_Code": "SOX",
"Enforcement_Action_Description": "State Compliance achieved",
"Is_Major_Violation": "Y",
"Severity_Indicator_Count": "-",
"Public_Notification_Tier": "3",
"Violation_First_Reported_Date": "2016-05-24T00:00:00"
},
{
"PWS_ID": "AK2300222",
"PWS_Name": "ARCTIC VILLAGE WATER SYSTEM",
"EPA_Region": "Region 10",
"Primacy_Agency": "Alaska",
"PWS_Type": "Community water system",
"Primacy_Type": "State",
"Primary_Source": "Surface water",
"Activity_Status": "Active",
"Deactivation_Date": "-",
"Population_Served_Count": "175",
"Rule_Name": "Long Term 1 Enhanced Surface Water Treatment Rule",
"Violation_Code": "38",
"Violation_Type": "Monitoring, Turbidity (Enhanced SWTR)",
"Violation_Category_Code": "MR",
"Is_Health_Based": "false",
"Contaminant_Name": "Interim Enhanced Surfsce Water Treatment Rule",
"Compliance_Period_Begin_Date": "2017-01-01T00:00:00",
"Compliance_Period_End_Date": "31-JAN-2017",
"Compliance_Status": "Returned to Compliance",
"RTC_Date": "21-APR-2017",
"Enforcement_Action_Type_Code": "SIB",
"Enforcement_Action_Description": "State Compliance Meeting conducted",
"Is_Major_Violation": "Y",
"Severity_Indicator_Count": "-",
"Public_Notification_Tier": "3",
"Violation_First_Reported_Date": "2017-05-08T00:00:00"
}
// ...7432 more rows
]