jenningsanderson/daylight_osm_distribution_parquet.md

Last active November 13, 2024 18:28

Star (11) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jenningsanderson/3e42a99dcb8f760038ad8aa47ea38ce8.js"></script>
Save jenningsanderson/3e42a99dcb8f760038ad8aa47ea38ce8 to your computer and use it in GitHub Desktop.

Download ZIP

"Analysis Ready" Daylight OSM Distribution Available on AWS

Raw

daylight_osm_distribution_parquet.md

What are the Daylight OpenStreetMap Parquet Files?

Listed on the registry of Open Data on AWS, the Daylight OpenStreetMap Parquet files contain the latest Daylight Map Distribution of OpenStreetMap in an analysis-ready format. This dataset is optimized for cloud-based queries with Amazon Athena, meaning anyone can access the entire dataset with SQL queries in the browser, without the need to download or access the files directly.

The Daylight Map Distribution of OpenStreetMap is always openly available for download in the standard OSM PBF format (find it at daylightmap.org). The parquet files, however, were first made available alongside Daylight release v1.9. They contain fully resolved geometries and additional metadata including areas, lengths, and quadkeys, not present in the PBF.

In total, the OSM features files contain nearly 1B features including 178M+ nodes, 816M+ ways, and 5M+ relations. These are all renderable features, meaning they have geometries that can be rendered on a map. The OSM features files therefore do not include untagged nodes or relations without basic geometries (such as turn restrictions). Untagged objects can also be found in this analysis-ready format in the OSM elements files, described in the comments.

Setting Up The `daylight_osm_features` Table

The following CREATE TABLE command will add the daylight_osm_features table to your AWS data (i.e. Glue) catalog. For best performance, make sure your Athena console is running in the same region as the s3://daylight-openstreetmap bucket: us-west-2.

CREATE EXTERNAL TABLE `daylight_osm_features`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double,
  `square_meters` double)
PARTITIONED BY ( 
  `release` string,
  `type` string)
STORED AS PARQUET
LOCATION
  's3://daylight-openstreetmap/parquet/osm_features/'

Next, load the partitions by running: MSCK REPAIR TABLE daylight_osm_features.

Remember to use WHERE release = to query only one dataset (such as v1.10 or v1.9). These are the release numbers of the Daylight Map Distribution. Omitting this WHERE clause will result in querying duplicate features.

Basic Examples

1. Total kilometers of different highway features in OSM?

SELECT tags [ 'highway' ] AS highway_tag,
    count(id) AS osm_features,
    count(distinct(changeset)) AS total_changesets,
	sum(linear_meters) / 1000 AS total_km
FROM daylight_osm_features
WHERE linear_meters > 0 AND tags['highway'] IS NOT NULL
  AND release='v1.9'
GROUP BY tags [ 'highway' ]
ORDER BY total_km DESC

2. Extract point geometries for all hospital features in OSM?

SELECT 
    tags['name'] as name,
    ST_CENTROID(ST_GEOMETRYFROMTEXT(wkt)) as center,
    tags['amenity'] as amenity
FROM daylight_osm_features
WHERE tags['amenity'] LIKE 'hospital'
  AND release='v1.9'

Author

jenningsanderson commented Dec 23, 2021 •

edited

Loading

Extrapolating on the Hospital Features query, here is a comaprison of all amenity=hospital (red) and amenity=ice_cream (blue) features in OpenStreetMap. You can see that ice cream shops are far more prevalent on coasts around the world (shown almost outlining Western Europe), with high density of both features (shown in white) in major cities globally.

Data (c) OpenStreetMap Contributors

Albeit a silly example, this query takes ~25 seconds and generated 16MB of data, demonstrating both the power and simplicity (fewer than 5 lines) of this approach to OSM data analysis.

Author

jenningsanderson commented Feb 16, 2022

For those familiar with the OSM data model, The OSM Elements files include all of the untagged nodes as well as the refs and roles for relation members and nodes within ways. The OSM Elements table contains all of the information in the Daylight OSM PBF, in addition to resolved geometries (where possible), bounding box, length, area, and quadkey.

The OSM Elements Table can be created with this query:

CREATE EXTERNAL TABLE `daylight_osm_elements`(
  `id` bigint, 
  `version` int, 
  `changeset` bigint, 
  `created_at` timestamp, 
  `tags` map<string,string>, 
  `wkt` string, 
  `min_lon` double, 
  `max_lon` double, 
  `min_lat` double, 
  `max_lat` double, 
  `quadkey` string, 
  `linear_meters` double, 
  `square_meters` double, 
  `node_refs` array<bigint>, 
  `node_roles` array<string>, 
  `way_refs` array<bigint>, 
  `way_roles` array<string>, 
  `rel_refs` array<bigint>, 
  `rel_roles` array<string>)
PARTITIONED BY ( 
  `release` string, 
  `type` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://daylight-openstreetmap/parquet/osm_elements'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'parquet.compression'='GZIP')

jonah-dawg commented Apr 21, 2023 •

edited

Loading

Is there an IAM role or policy needed in addition to any of this?

I’m encountering a “user denied” error when running the MSCK command to load the partitions. In the past we’ve created tables over external OSM data so I doubt that’s it

Author

jenningsanderson commented Apr 21, 2023

@jonah-dawg - interesting, maybe double check your Athena results destination / permissions your user has over these types of queries. My guess is that the error is coming from athena trying to write the results of the partition load query.

Yes, if you’re using an IAM role for this, there are a number of glue and Athena permissions on your account required to run queries / affect the data catalog, typically the error returned however lists exactly the permission required.

Author

jenningsanderson commented Apr 21, 2023

Additionally, it might not be necessary to load all partitions. You can use the “alter table, add partition” syntax for release=v1.25 if you only want the latest data. Maybe see if that returns a more descriptive error with regard to permissions?

jonah-dawg commented Apr 21, 2023

@jenningsanderson thanks for the recommendation, it did reveal some more explicit errors

jenningsanderson/daylight_osm_distribution_parquet.md

Select an option

No results found

Select an option

No results found

What are the Daylight OpenStreetMap Parquet Files?

Setting Up The `daylight_osm_features` Table

Basic Examples

1. Total kilometers of different highway features in OSM?

2. Extract point geometries for all hospital features in OSM?

jenningsanderson commented Dec 23, 2021 •

edited

Loading

Uh oh!

jenningsanderson commented Feb 16, 2022

Uh oh!

jonah-dawg commented Apr 21, 2023 •

edited

Loading

Uh oh!

jenningsanderson commented Apr 21, 2023

Uh oh!

jenningsanderson commented Apr 21, 2023

Uh oh!

jonah-dawg commented Apr 21, 2023

Uh oh!

jenningsanderson/daylight_osm_distribution_parquet.md

What are the Daylight OpenStreetMap Parquet Files?

Setting Up The daylight_osm_features Table

Basic Examples

1. Total kilometers of different highway features in OSM?

2. Extract point geometries for all hospital features in OSM?

jenningsanderson commented Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenningsanderson commented Feb 16, 2022

Uh oh!

jonah-dawg commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenningsanderson commented Apr 21, 2023

Uh oh!

jenningsanderson commented Apr 21, 2023

Uh oh!

jonah-dawg commented Apr 21, 2023

Uh oh!

Setting Up The `daylight_osm_features` Table

jenningsanderson commented Dec 23, 2021 •

edited

Loading

jonah-dawg commented Apr 21, 2023 •

edited

Loading