How annual AVG PM2.5 values were calculated for the OpenAQ 2019 PM2.5 City Scan
Important: The PM2.5 annual average calculations int the OpenAQ 2019 City Scan were performed only with data accessed on OpenAQ. This may not represent the full set of relevant data available in the city or even from the sources sharing data on OpenAQ. There is also considerable variability with the percentage data coverage over the year for many cities. Therefore, these annual averages should not be taken necessarily as representative of a city or as 'official' results from an underlying source.
- City-level averages were generated by averaging data from 1 Jan to 31 Dec 2019.
- Averages were created by station and then averaged across all stations in a city.
- The % time data reporting to OpenAQ was calculated by taking the number of data points reported and dividing by the total possible for the year (e.g. for hourly data: 24 * 365).
- A sample of the SQL code used in Athena is below.
The 'cleaning' techniuques used were the ones deemed universally or near-universally applicable to PM2.5 data collected in a variety of methods using different instrumentation in countries across the globe. The following was done:
- All values below 0 were excluded.
- Several values typically flagged by suppliers were excluded. (See SQL code below).
- Many of the US Embassy locations show up (almost) twice in OpenAQ, once in the AirNow source data and another through StateAir. There is commented out lines in the SQL code below that allow for selection of the specific station to retrieve only one set of the data. Otherwise, the average is skewed (since the data does now show up exactly twice) and the percent reporting is incorrect (e.g. percent reporting values will exceed 100%).
- City and location (aka station) names for data from the Chinese government source appear in Mandarin and do not show up correctly when viewed in a downloadable CSV file from Athena. The cumbersome workaround I used was to look at this list of Mandarin<-->English alphabet names for Chinese cities, copy and paste the Mandarin name for the city of interest into the SQL code below under (see commented out section below that specifies a city)
- The SQL code below lets the user specify hourly or 15 minute data (or any other raw time-average of the data, but these are the two most common ingested into the OpenAQ Platform). There's a better, more automated way for someone savvier than me, but I went by hand and put the hourly and 15-minute averaged station data in for particular cities, like those in India where there is a mix of both hourly and fifteen-minute data.
SELECT *
FROM
(
SELECT country,
array_agg(distinct(city)) AS city,
array_agg(distinct(location)) AS locations,
date_format(date_trunc('year', from_iso8601_timestamp(date.local)), '%Y') as year,
avg(value) AS average,
round(((count(value) / (24.0 * 365.0)) * 100.0)) as percent_reporting, -- calculates % reporting for hourly-averaged data
-- round(((count(value) / (4*24.0 * 365.0)) * 100.0)) as percent_reporting, -- calculates % reporting for 15 min -averaged data
count(value) AS num_measurements,
round(coordinates.longitude,5) AS longitude,
round(coordinates.latitude, 5) AS latitude,
array_agg(distinct(cast(attribution as JSON))) as source
FROM fetches_realtime
WHERE parameter = 'pm25'
AND city = '广州市'
-- AND country = 'IN'
-- AND location = 'Colaba'
-- AND location LIKE 'Colaba%'
AND unit = 'µg/m³'
AND date.local
BETWEEN '2019-01-01'
AND '2019-12-31'
AND value >= 0 -- Other suggestion: remove data below 4ug typical detection limits of PM monitors
AND value <> 985 -- Removes value that is reported as a supplier flagged value
AND value <>999 -- Removes value that is reported as a supplier flagged value
AND value <>1985 -- Removes value that is reported as a supplier flagged value
AND value <>915 -- Removes value that is reported as a supplier flagged value
AND value <>515 -- Removes value that is reported as a supplier flagged value
AND averagingperiod.value = 1 -- In hours
AND coordinates.longitude BETWEEN -180 AND 180
AND coordinates.latitude BETWEEN -90 AND 90
AND coordinates.latitude <> 0.0
AND coordinates.longitude <> 0.0
GROUP BY round(coordinates.longitude,5), round(coordinates.latitude,5), country, date_format(date_trunc('year', from_iso8601_timestamp(date.local)), '%Y')
ORDER BY country, city, year
)
WHERE
percent_reporting >= 0 -- Can be adjusted to put in a station-level % of time reporting threshold