Skip to content

Instantly share code, notes, and snippets.

@RocketD0g
Last active September 29, 2020 23:52
Show Gist options
  • Save RocketD0g/b65807b8cda060113edb291c4ab72ec5 to your computer and use it in GitHub Desktop.
Save RocketD0g/b65807b8cda060113edb291c4ab72ec5 to your computer and use it in GitHub Desktop.
Description of Methods for the 2019 OpenAQ PM2.5 City Scan

How annual AVG PM2.5 values were calculated for the OpenAQ 2019 PM2.5 City Scan

Important: The PM2.5 annual average calculations int the OpenAQ 2019 City Scan were performed only with data accessed on OpenAQ. This may not represent the full set of relevant data available in the city or even from the sources sharing data on OpenAQ. There is also considerable variability with the percentage data coverage over the year for many cities. Therefore, these annual averages should not be taken necessarily as representative of a city or as 'official' results from an underlying source.

General

  • City-level averages were generated by averaging data from 1 Jan to 31 Dec 2019.
  • Averages were created by station and then averaged across all stations in a city.
  • The % time data reporting to OpenAQ was calculated by taking the number of data points reported and dividing by the total possible for the year (e.g. for hourly data: 24 * 365).
  • A sample of the SQL code used in Athena is below.

Cleaning

The 'cleaning' techniuques used were the ones deemed universally or near-universally applicable to PM2.5 data collected in a variety of methods using different instrumentation in countries across the globe. The following was done:

  • All values below 0 were excluded.
  • Several values typically flagged by suppliers were excluded. (See SQL code below).

Peculiaraties to averaging data accessed on OpenAQ

  • Many of the US Embassy locations show up (almost) twice in OpenAQ, once in the AirNow source data and another through StateAir. There is commented out lines in the SQL code below that allow for selection of the specific station to retrieve only one set of the data. Otherwise, the average is skewed (since the data does now show up exactly twice) and the percent reporting is incorrect (e.g. percent reporting values will exceed 100%).
  • City and location (aka station) names for data from the Chinese government source appear in Mandarin and do not show up correctly when viewed in a downloadable CSV file from Athena. The cumbersome workaround I used was to look at this list of Mandarin<-->English alphabet names for Chinese cities, copy and paste the Mandarin name for the city of interest into the SQL code below under (see commented out section below that specifies a city)
  • The SQL code below lets the user specify hourly or 15 minute data (or any other raw time-average of the data, but these are the two most common ingested into the OpenAQ Platform). There's a better, more automated way for someone savvier than me, but I went by hand and put the hourly and 15-minute averaged station data in for particular cities, like those in India where there is a mix of both hourly and fifteen-minute data.

Sample SQL code used in Athena:

SELECT *

FROM

(
SELECT country,
array_agg(distinct(city)) AS city,
array_agg(distinct(location)) AS locations,
date_format(date_trunc('year', from_iso8601_timestamp(date.local)), '%Y') as year,
avg(value) AS average,
round(((count(value) / (24.0 * 365.0)) * 100.0)) as percent_reporting, -- calculates % reporting for hourly-averaged data
-- round(((count(value) / (4*24.0 * 365.0)) * 100.0)) as percent_reporting, -- calculates % reporting for 15 min -averaged data
  
count(value) AS num_measurements,
round(coordinates.longitude,5) AS longitude,
round(coordinates.latitude, 5) AS latitude,
array_agg(distinct(cast(attribution as JSON))) as source
FROM fetches_realtime
WHERE parameter = 'pm25'
AND city = '广州市'
-- AND country = 'IN'
-- AND location = 'Colaba'
-- AND location LIKE 'Colaba%'
AND unit = 'µg/m³'
AND date.local
BETWEEN '2019-01-01'
AND '2019-12-31'
AND value >= 0 -- Other suggestion: remove data below 4ug typical detection limits of PM monitors
AND value <> 985 -- Removes value that is reported as a supplier flagged value
AND value <>999 -- Removes value that is reported as a supplier flagged value
AND value <>1985 -- Removes value that is reported as a supplier flagged value
AND value <>915 -- Removes value that is reported as a supplier flagged value
AND value <>515 -- Removes value that is reported as a supplier flagged value
AND averagingperiod.value = 1 -- In hours
AND coordinates.longitude BETWEEN -180 AND 180
AND coordinates.latitude BETWEEN -90 AND 90
AND coordinates.latitude <> 0.0
AND coordinates.longitude <> 0.0
GROUP BY round(coordinates.longitude,5), round(coordinates.latitude,5), country, date_format(date_trunc('year', from_iso8601_timestamp(date.local)), '%Y')
ORDER BY country, city, year
)

WHERE
percent_reporting >= 0 -- Can be adjusted to put in a station-level % of time reporting threshold
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment