Created
August 19, 2020 13:28
-
-
Save jorisvandenbossche/00e5c4a54f7b94375ccc6921c07825a0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Representation of geometries in Apache Arrow memory layout" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Illustration for the discussions at https://github.com/geopandas/geo-arrow-spec/issues/4/ and https://github.com/geopandas/geo-arrow-spec/issues/3/ about alternative ways (other than WKB) to store geometries in Arrow / Parquet." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import pyarrow as pa" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Example array of geometries\n", | |
| "\n", | |
| "I am taking a few example geometries from the Wiki page (https://en.wikipedia.org/wiki/GeoJSON):\n", | |
| "\n", | |
| "\n", | |
| "\n", | |
| "```python\n", | |
| "{\n", | |
| " \"type\": \"MultiPolygon\", \n", | |
| " \"coordinates\": [\n", | |
| " [\n", | |
| " [[40, 40], [20, 45], [45, 30], [40, 40]]\n", | |
| " ], \n", | |
| " [\n", | |
| " [[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], \n", | |
| " [[30, 20], [20, 15], [20, 25], [30, 20]]\n", | |
| " ]\n", | |
| " ]\n", | |
| "}\n", | |
| "```\n", | |
| "\n", | |
| "\n", | |
| "\n", | |
| "```python\n", | |
| "{\n", | |
| " \"type\": \"Polygon\", \n", | |
| " \"coordinates\": [\n", | |
| " [[35, 10], [45, 45], [15, 40], [10, 20], [35, 10]], \n", | |
| " [[20, 30], [35, 35], [30, 20], [20, 30]]\n", | |
| " ]\n", | |
| "}\n", | |
| "```\n", | |
| "\n", | |
| "\n", | |
| "\n", | |
| "```python\n", | |
| "{\n", | |
| " \"type\": \"MultiPolygon\", \n", | |
| " \"coordinates\": [\n", | |
| " [\n", | |
| " [[30, 20], [45, 40], [10, 40], [30, 20]]\n", | |
| " ], \n", | |
| " [\n", | |
| " [[15, 5], [40, 10], [10, 20], [5, 10], [15, 5]]\n", | |
| " ]\n", | |
| " ]\n", | |
| "}\n", | |
| "```\n", | |
| "\n", | |
| "Those three geometries (2 MultiPolygons, and one Polygon) will be stored in an array, as if you have a GeoDataFrame with 3 rows:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import geopandas\n", | |
| "\n", | |
| "gdf = geopandas.GeoDataFrame.from_features({'type': 'FeatureCollection',\n", | |
| " 'features': [{'type': 'Feature',\n", | |
| " 'geometry': {'type': 'MultiPolygon',\n", | |
| " 'coordinates': [[[[40, 40], [20, 45], [45, 30], [40, 40]]],\n", | |
| " [[[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]],\n", | |
| " [[30, 20], [20, 15], [20, 25], [30, 20]]]]},\n", | |
| " 'properties': {'attribute': 1}},\n", | |
| " {'type': 'Feature',\n", | |
| " 'geometry': {'type': 'Polygon',\n", | |
| " 'coordinates': [[[35, 10], [45, 45], [15, 40], [10, 20], [35, 10]],\n", | |
| " [[20, 30], [35, 35], [30, 20], [20, 30]]]},\n", | |
| " 'properties': {'attribute': 2}},\n", | |
| " {'type': 'Feature',\n", | |
| " 'geometry': {'type': 'MultiPolygon',\n", | |
| " 'coordinates': [[[[30, 20], [45, 40], [10, 40], [30, 20]]],\n", | |
| " [[[15, 5], [40, 10], [10, 20], [5, 10], [15, 5]]]]},\n", | |
| " 'properties': {'attribute': 3}}]})[['attribute', 'geometry']]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>attribute</th>\n", | |
| " <th>geometry</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>1</td>\n", | |
| " <td>MULTIPOLYGON (((40.00000 40.00000, 20.00000 45...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>2</td>\n", | |
| " <td>POLYGON ((35.00000 10.00000, 45.00000 45.00000...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>3</td>\n", | |
| " <td>MULTIPOLYGON (((30.00000 20.00000, 45.00000 40...</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " attribute geometry\n", | |
| "0 1 MULTIPOLYGON (((40.00000 40.00000, 20.00000 45...\n", | |
| "1 2 POLYGON ((35.00000 10.00000, 45.00000 45.00000...\n", | |
| "2 3 MULTIPOLYGON (((30.00000 20.00000, 45.00000 40..." | |
| ] | |
| }, | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "gdf" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "So we will look into how this \"geometry\" column with 3 geometries can be stored in an Arrow array." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Constructing an Arrow array with MultiPolygon geometries" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The Arrow type being proposed in https://github.com/geopandas/geo-arrow-spec/issues/4/ is a \"nested list array\" using the Arrow `ListArray` type.\n", | |
| "\n", | |
| "Note: for the example here, we construct them from simple python lists and dicts. This is of course not how you would construct them in a performance-sensitive application (`pyarrow` offers other methods to construct those arrays more efficiently)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "For MultiPolygons, we are going to use the ([variable-size list layout](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout)) with multiple levels of nesting. Constructing the type explicitly: \n", | |
| "\n", | |
| "* each element in the array is a list of \"parts\" (the sub-polygons of the MultiPolgyon) ..\n", | |
| "* .. where each part consists of a list of \"rings\" (the exterior + potential interior rings) ..\n", | |
| "* .. where each ring consists of a list of (x, y) \"vertices\"\n", | |
| "\n", | |
| "Only the outer level (the actual geometries) can contain nulls, the inner levels (parts, rings, vertices) cannot be null.\n", | |
| "\n", | |
| "Putting this logic in an Arrow type:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "multipolygon_type = pa.list_(\n", | |
| " pa.field(\"parts\", pa.list_(\n", | |
| " pa.field(\"rings\", pa.list_(\n", | |
| " pa.field(\"vertices\", pa.list_(\n", | |
| " pa.field(\"xy\", pa.float64(), nullable=False), 2),\n", | |
| " nullable=False)), nullable=False))))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "list<parts: list<rings: list<vertices: fixed_size_list<xy: double not null>[2] not null> not null>>\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(multipolygon_type)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We can now convert our toy geometries to an Arrow array using this type:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "multi_polygon_1 = [\n", | |
| " [\n", | |
| " [[40., 40], [20, 45], [45, 30], [40, 40]]\n", | |
| " ], \n", | |
| " [\n", | |
| " [[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], \n", | |
| " [[30, 20], [20, 15], [20, 25], [30, 20]]\n", | |
| " ]\n", | |
| "]\n", | |
| "\n", | |
| "# using an additional level of nesting to turn the Polygon into a MultiPolygon with one part\n", | |
| "multi_polygon_2 = [\n", | |
| " [\n", | |
| " [[30, 10], [40, 40], [20, 40], [10, 20], [30, 10]]\n", | |
| " ]\n", | |
| "]\n", | |
| "\n", | |
| "multi_polygon_3 = [\n", | |
| " [\n", | |
| " [[30, 20], [45, 40], [10, 40], [30, 20]]\n", | |
| " ], \n", | |
| " [\n", | |
| " [[15, 5], [40, 10], [10, 20], [5, 10], [15, 5]]\n", | |
| " ]\n", | |
| "]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "arr = pa.array([multi_polygon_1, multi_polygon_2, multi_polygon_3], type=multipolygon_type)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## How is this \"nested lists\" Arrow array stored in memory?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Deconstructing the nested ListArray\n", | |
| "\n", | |
| "The Arrow `ListArray` is under the hood stored as multiple arrays of offsets and one flat array of all coordinates. Those different arrays can be accessed with zero-copy in Python using the pyarrow API:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "offsets1 = np.asarray(arr.offsets)\n", | |
| "offsets2 = np.asarray(arr.values.offsets)\n", | |
| "offsets3 = np.asarray(arr.values.values.offsets)\n", | |
| "coordinates = np.asarray(arr.values.values.values.values)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The flat array of all coordinates of the 3 geometries combined:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array([40., 40., 20., 45., 45., 30., 40., 40., 20., 35., 10., 30., 10.,\n", | |
| " 10., 30., 5., 45., 20., 20., 35., 30., 20., 20., 15., 20., 25.,\n", | |
| " 30., 20., 30., 10., 40., 40., 20., 40., 10., 20., 30., 10., 30.,\n", | |
| " 20., 45., 40., 10., 40., 30., 20., 15., 5., 40., 10., 10., 20.,\n", | |
| " 5., 10., 15., 5.])" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "coordinates" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "And a set of \"offsets\" or \"indices\" that allow to interpret the flat array of coordinates as a set of MultiPolygons:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 36, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "GeometryIndices (index into PolygonIndices for start of new MultiPolygon): [0 2 3 5]\n", | |
| "PolygonIndices (index into RingIndices for start of new polygons): [0 1 3 4 5 6]\n", | |
| "RingIndices (index into coordinates for start of rings): [ 0 8 20 28 38 46 56]\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(\"GeometryIndices (index into PolygonIndices for start of new MultiPolygon): \", offsets1)\n", | |
| "print(\"PolygonIndices (index into RingIndices for start of new polygons): \", offsets2)\n", | |
| "print(\"RingIndices (index into coordinates for start of rings): \", offsets3 * 2)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "If we check the number of \"buffers\" stored by the array, we will find the same 4 arrays (the additional `None` values are empty validity bitmaps):" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "[None,\n", | |
| " <pyarrow.lib.Buffer at 0x7efe43f975f0>,\n", | |
| " None,\n", | |
| " <pyarrow.lib.Buffer at 0x7efe43f79b30>,\n", | |
| " None,\n", | |
| " <pyarrow.lib.Buffer at 0x7efe43f79bf0>,\n", | |
| " None,\n", | |
| " None,\n", | |
| " <pyarrow.lib.Buffer at 0x7efe43f79c30>]" | |
| ] | |
| }, | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "arr.buffers()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Representing the different geometry types" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Similarly to the MultiPolygon type, we can also represent the other basic types:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "fixed_size_list<xy: double not null>[2]\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "point_type = pa.list_(pa.field(\"xy\", pa.float64(), nullable=False), 2)\n", | |
| "print(point_type)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "list<parts: fixed_size_list<xy: double not null>[2] not null>\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "multipoint_type = pa.list_(pa.field(\"parts\", pa.list_(pa.field(\"xy\", pa.float64(), nullable=False), 2), nullable=False))\n", | |
| "print(multipoint_type)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "list<parts: list<vertices: fixed_size_list<xy: double not null>[2] not null>>\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "multiline_type = pa.list_(pa.field(\"parts\", pa.list_(pa.field(\"vertices\", pa.list_(pa.field(\"xy\", pa.float64(), nullable=False), 2), nullable=False))))\n", | |
| "print(multiline_type)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Then, a general \"geometry\" type that can store all those mixed in a single array, can be defined as a Union type of the types above:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 19, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "union[dense]<points: list<parts: fixed_size_list<xy: double not null>[2] not null>=0, lines: list<parts: list<vertices: fixed_size_list<xy: double not null>[2] not null>>=1, polygons: list<parts: list<rings: list<vertices: fixed_size_list<xy: double not null>[2] not null> not null>>=2>\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "geometry_type = pa.union(\n", | |
| " [\n", | |
| " pa.field(\"points\", multipoint_type),\n", | |
| " pa.field(\"lines\", multiline_type),\n", | |
| " pa.field(\"polygons\", multipolygon_type)\n", | |
| " ],\n", | |
| " \"dense\")\n", | |
| "print(geometry_type)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "And a \"GeometryCollection\" type could be represented as a List of such union:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 20, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "list<item: union[dense]<points: list<parts: fixed_size_list<xy: double not null>[2] not null>=0, lines: list<parts: list<vertices: fixed_size_list<xy: double not null>[2] not null>>=1, polygons: list<parts: list<rings: list<vertices: fixed_size_list<xy: double not null>[2] not null> not null>>=2>>\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(pa.list_(geometry_type))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python (geo-dev)", | |
| "language": "python", | |
| "name": "geo-dev" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.8.2" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 4 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment