jflasher/details.md

Currently, only data within the last 90 days is available via the OpenAQ API. However, there is much more data available on OpenAQ and a variety of different access mechanisms. Note also that there is work under way to bring back to the API a mechanism to access the data older than 90 days, details here.

If you're looking to query across all the data or even easily export the data (or a subset of it), the easiest way to do that currently is using a service like Amazon Athena. I'll provide some directions on how to do that below, but at a high level, this will let you make any query of the entire dataset that you'd like (written in SQL). I'll also provide some sample queries so you can see what's possible.

On to the directions!

You will need to create an AWS account if you don't currently have one, you can start this process at https://aws.amazon.com/free/. Note that while an AWS account does not cost any money, running a query using Athena will. At currernt OpenAQ volume sizes, querying the whole dataset costs ~$.01.
Log in to your AWS account and go to the Athena console at https://console.aws.amazon.com/athena/home?region=us-east-1. Please make sure you're in the US East (N. Virginia) region as this will make your queries faster and save OpenAQ money on data transfer. You can confirm this by looking at the region name in the top right of the console.
There is a lot more documentation on Athena here, but you should now be looking at the main console where you can enter any query. But first, you'll need to create a table on top of the OpenAQ data. To do this, you can copy and paste the table definition in the openaq.ddl file that's also in the gist. Copy and paste that and click Run Query. In the Results pane at the bottom of the window, you should see Query Successful.
That's it, now you're ready to make any query you'd like! I'll provide some examples below to get you started.

Example queries

A simple query to just get all the data for a particular location (or city or country)

SELECT *
FROM openaq
WHERE location='US Diplomatic Post: Hanoi'

The average of all pm25 measurements for Baruun 4 zam location in Mongolia

SELECT avg(value) AS "avg_pm25"
FROM openaq
WHERE parameter='pm25'
        AND location='Baruun 4 zam';

The number of measurements over 35 µg/m³ in India

SELECT count(*) AS "num_over"
FROM openaq
WHERE parameter='pm25'
        AND value > 35
        AND country = 'IN'

Top 10 highest pm25 measurements sorted by highest and then date

SELECT *
FROM openaq
WHERE parameter='pm25'
ORDER BY  value desc, date.utc DESC limit 10

Athena also has geospatial support which means you can do things like ask for all the measurements around a certain latitude and longitude

SELECT *
FROM openaq
WHERE ST_WITHIN(ST_POINT(coordinates.longitude, coordinates.latitude), ST_BUFFER(ST_POINT(4.773,52.428), 0.1));

Notes

The great thing about this mechanism is that you don't need to duplicate any data, you simply query using Athena and it runs on the latest data that OpenAQ maintains.
Athena has the full power of SQL behind it, so you have a ton of power to write any query you want, you're not limited by an API that OpenAQ creates. More details on SQL in general here.
If you're looking to use the data from these queries outside of visual inspection in Athena, you can click the 'Download CSV' button on the top of the Results pane. The CSV results will also be stored in an Amazon S3 bucket on your account which you can programatically access if you'd like.
Note that the underlying data you pointed to may change its structure as we figure out the most performant way to make the data avaible. We'll announce any changes and if that does happen, you'd just need to copy/paste another openaq.ddl type file that'd update your table.

RocketD0g · 2020-08-11T17:19:32Z

An update: Data are now accessible from the OpenAQ API going back >2 years instead of 90 days.

adjanip · 2020-10-05T08:04:38Z

@jflasher I am trying to combine the advice you gave to TomasDob and scand363, but I am not getting the right thing.
I would like to get the average daily value (for all 7 variables) for all individual monitor locations in the Netherlands between January 16, 2020 and May 16, 2020.

I ran:

SELECT date_format(date_trunc('day', from_iso8601_timestamp(date.local)), '%Y-%m-%d') as day, parameter, avg(value) as average, count(value) as num_measurements
FROM openaq
WHERE country = 'NL' and date.local between '2020-01-16' and '2020-05-17'
GROUP BY date_trunc('day', from_iso8601_timestamp(date.local)), parameter
ORDER BY day asc, parameter

Which gives me correctly

all 7 variables
the correct dates

but incorrectly

only one value for each variable (I assume the national average of that variable for the whole country)
All the daily values are negative for all of the dates

maelle · 2020-10-12T09:22:58Z

@RocketD0g @jflasher where is the new data availability documented btw? https://docs.openaq.org/#api-Measurements-GetV1Measurements still mention "Currently the API only supports a rolling 90 day archive of the data."? I'd like to update ropenaq (a bit late, sorry).

RocketD0g · 2020-10-12T10:25:20Z

Hey @maelle! I've made a PR for the changes, but I don't they haven't been added in yet. Here's some documentation though: https://medium.com/@openaq/openaq-extends-api-and-download-tool-access-from-90-days-to-2-years-3697540c85a3

maelle · 2020-10-12T10:32:53Z

awesome, thank you!

jflasher · 2020-10-12T13:52:02Z

@adjanip I believe the below will get you what you're looking for. I am filtering out values less than 0 (often error codes) which can erroneously affect the results. I am also grouping by location.

SELECT date_format(date_trunc('day', from_iso8601_timestamp(date.local)), '%Y-%m-%d') as day, location, parameter, avg(value) as average, count(value) as num_measurements
FROM openaq
WHERE country = 'NL' and date.local between '2020-01-16' and '2020-05-17' and value >= 0
GROUP BY date_trunc('day', from_iso8601_timestamp(date.local)), parameter, location
ORDER BY day asc, parameter

Ester452 · 2020-10-13T08:07:29Z

Hi! I get this error attached below when copying and pasting the openaq.ddl.

I need for my master thesis the daily emission of the main air pollutant (pm25,pm20,NO2,CO) of the last 10 or 15 years for some italian's regions. I completely new to coding and I really have no idea how to formulate that! If anyone could help I'll be really really glad.
Thak you!

jflasher · 2020-10-15T15:37:33Z

Hi @Ester452, it looks like you need to give your output bucket a more unique name so that it can be created. You may want to ask any questions over on the OpenAQ Slack channels (https://openaq-slackin.herokuapp.com/), this gist is not seen by most of the community.

sruti · 2020-10-16T13:54:58Z

@Ester452 that slack link doesn't work anymore, use this one instead: https://join.slack.com/t/openaq/shared_invite/zt-gq14aew7-CVdp131g7TR7o9iiXIVDLw

alfredocambera · 2020-12-09T17:23:43Z

You can also use AWS Glue with the S3 bucket and it will create the DDL for the table.

MaxPowerWasTaken · 2021-04-19T06:48:24Z

Hi Jeff,

I landed here from the recommendation at this blog post to use your ddl file to query OpenAQ data in Athena.

However, I notice this ddl query ends with:

TBLPROPERTIES (
  'transient_lastDdlTime'='1518373755')

...which is about three years ago, so I'm guessing I don't want to use that unchanged.

Would it be better to delete that TBLPROPERTIES statement altogether, or to keep it and sub in some more recent timestamp?

Thanks!

jflasher · 2021-04-19T23:11:09Z

Hi @MaxPowerWasTaken you should be good to delete it or leave as is. It's just internal metadata, do not believe it matters in this case.

MaxPowerWasTaken · 2021-04-21T14:08:06Z

Hi thanks @jflasher.

When I ran this ddl query (without the TBLPROPERTIES statement) I got the following error:
FAILED: ParseException line 5:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'

I am using the new Athena engine v2, so maybe that has something to do with it. In any case, here's how I got around that error, in case it's helpful to anyone else:

I cataloged the s3 data by using a glue crawler to register it as a hive table, just as the ddl statement would. Once glue cataloged the table, I was also able to use Athena to show the ddl statement that would create the table as cataloged by glue.

Running (in Athena) SHOW CREATE TABLE openaq_gz_realtime_gzipped; yields:

CREATE EXTERNAL TABLE `openaq_gz_realtime_gzipped`(
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` double COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<value:double,unit:string> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:double,longitude:double> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` boolean COMMENT 'from deserializer')
PARTITIONED BY ( 
  `partition_0` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://openaq-fetches/realtime-gzipped/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='OpenAQ-realtime-gz-crawler', 
  'averageRecordSize'='832', 
  'classification'='json', 
  'compressionType'='gzip', 
  'objectCount'='150401', 
  'recordCount'='26197990', 
  'sizeKey'='20582896694', 
  'typeOfData'='file')

ShamamaAfnan · 2021-11-01T18:50:51Z

Hi, openaq ddl ran successfully. I got this error while running the example queries:

cszc · 2021-11-04T14:05:20Z

Hi All. There were two things that I had trouble figuring out, so thought I'd share here in case it's helpful to anyone.

First, I wanted to be able to partition the table to speed up queries and limit gbs scanned. That wasn't super straightforward because the partitions aren't in a hive compatible format, i.e. day=yyyy-MM-dd. Instead, I had to use a partition projection with a custom partition template. You can see the projection definition below in the TBLPROPERTIES. This dramatically improved performance.
Secondly, I kept running into JSON errors similar to this one. By adding with serdeproperties ( 'ignore.malformed.json'='true' ), I was able to ignore the errant json so that my query would finish. The final ddl that I used is below. edit: @ShamamaAfnan Looks like the same problem you ran into

CREATE EXTERNAL TABLE IF NOT EXISTS `openaq` (
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` float COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` string COMMENT 'from deserializer')
PARTITIONED BY (day string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
with serdeproperties ( 'ignore.malformed.json'='true' )
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.range' = '2017-01-01,2021-11-01',
'projection.day.format' = 'yyyy-MM-dd',
'storage.location.template' = 's3://openaq-fetches/realtime-gzipped/${day}/'
)

kirazcloud · 2022-07-13T12:33:05Z

Hi,
If you get any error row json object invalid
Row is not a valid JSON Object - JSONException: Unterminated string at 201 [character 202 line 1]
Try to add this one 'ignore.malformed.json'='true'
https://stackoverflow.com/questions/71102999/ignore-malformated-json-from-aws-athena-query

jflasher · 2022-07-13T12:50:57Z

cc/ @russbiggs so he knows this exists.

russbiggs · 2023-03-23T20:29:05Z

OpenAQ will be deprecating this bucket and access method in the near future. An updated bucket and explanation of how to access the data archive via AWS Athena and the AWS CLI is available at docs.openaq.org/docs/accessing-openaq-archive-data

jflasher/details.md

Example queries

Notes

RocketD0g commented Aug 11, 2020

adjanip commented Oct 5, 2020 •

edited

Loading

maelle commented Oct 12, 2020

RocketD0g commented Oct 12, 2020

maelle commented Oct 12, 2020

jflasher commented Oct 12, 2020

Ester452 commented Oct 13, 2020

jflasher commented Oct 15, 2020

sruti commented Oct 16, 2020

alfredocambera commented Dec 9, 2020

MaxPowerWasTaken commented Apr 19, 2021 •

edited

Loading

jflasher commented Apr 19, 2021

MaxPowerWasTaken commented Apr 21, 2021

ShamamaAfnan commented Nov 1, 2021

cszc commented Nov 4, 2021

kirazcloud commented Jul 13, 2022

jflasher commented Jul 13, 2022

russbiggs commented Mar 23, 2023

	CREATE EXTERNAL TABLE `openaq`(
	`date` struct<utc:string,local:string> COMMENT 'from deserializer',
	`parameter` string COMMENT 'from deserializer',
	`location` string COMMENT 'from deserializer',
	`value` float COMMENT 'from deserializer',
	`unit` string COMMENT 'from deserializer',
	`city` string COMMENT 'from deserializer',
	`attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer',
	`averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer',
	`coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer',
	`country` string COMMENT 'from deserializer',
	`sourcename` string COMMENT 'from deserializer',
	`sourcetype` string COMMENT 'from deserializer',
	`mobile` string COMMENT 'from deserializer')
	ROW FORMAT SERDE
	'org.openx.data.jsonserde.JsonSerDe'
	STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
	OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
	LOCATION
	's3://openaq-fetches/realtime-gzipped'
	TBLPROPERTIES (
	'transient_lastDdlTime'='1518373755')

jflasher/details.md

Example queries

Notes

RocketD0g commented Aug 11, 2020

adjanip commented Oct 5, 2020 • edited Loading

I ran:

Which gives me correctly

but incorrectly

maelle commented Oct 12, 2020

RocketD0g commented Oct 12, 2020

maelle commented Oct 12, 2020

jflasher commented Oct 12, 2020

Ester452 commented Oct 13, 2020

jflasher commented Oct 15, 2020

sruti commented Oct 16, 2020

alfredocambera commented Dec 9, 2020

MaxPowerWasTaken commented Apr 19, 2021 • edited Loading

jflasher commented Apr 19, 2021

MaxPowerWasTaken commented Apr 21, 2021

ShamamaAfnan commented Nov 1, 2021

cszc commented Nov 4, 2021

kirazcloud commented Jul 13, 2022

jflasher commented Jul 13, 2022

russbiggs commented Mar 23, 2023

adjanip commented Oct 5, 2020 •

edited

Loading

MaxPowerWasTaken commented Apr 19, 2021 •

edited

Loading