jorisvandenbossche · May 5, 2020 13:47
diff --git a/geo-arrow-spec-illustration.ipynb b/geo-arrow-spec-illustration.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Representation of geometries in Apache Arrow memory layout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Illustration for the discussions at https://github.com/geopandas/geo-arrow-spec/issues/4/ and https://github.com/geopandas/geo-arrow-spec/issues/3/ about alternative ways (other than WKB) to store geometries in Arrow / Parquet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyarrow as pa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example array of geometries\n",
    "\n",
    "I am taking a few example geometries from the Wiki page (https://en.wikipedia.org/wiki/GeoJSON):\n",
    "\n",
    "![image](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/SFA_MultiPolygon.svg/51px-SFA_MultiPolygon.svg.png)\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"type\": \"MultiPolygon\", \n",
    "    \"coordinates\": [\n",
    "        [\n",
    "            [[40, 40], [20, 45], [45, 30], [40, 40]]\n",
    "        ], \n",
    "        [\n",
    "            [[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], \n",
    "            [[30, 20], [20, 15], [20, 25], [30, 20]]\n",
    "        ]\n",
    "    ]\n",
    "}\n",
    "```\n",
    "\n",
    "![image](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/SFA_Polygon.svg/51px-SFA_Polygon.svg.png)\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"type\": \"Polygon\", \n",
    "    \"coordinates\": [\n",
    "        [[35, 10], [45, 45], [15, 40], [10, 20], [35, 10]], \n",
    "        [[20, 30], [35, 35], [30, 20], [20, 30]]\n",
    "    ]\n",
    "}\n",
    "```\n",
    "\n",
    "![image](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/SFA_MultiPolygon.svg/51px-SFA_MultiPolygon.svg.png)\n",
    "\n",
    "```python\n",
    "{\n",
    "    \"type\": \"MultiPolygon\", \n",
    "    \"coordinates\": [\n",
    "        [\n",
    "            [[30, 20], [45, 40], [10, 40], [30, 20]]\n",
    "        ], \n",
    "        [\n",
    "            [[15, 5], [40, 10], [10, 20], [5, 10], [15, 5]]\n",
    "        ]\n",
    "    ]\n",
    "}\n",
    "```\n",
    "\n",
    "Those three geometries (2 MultiPolygons, and one Polygon) will be stored in an array, as if you have a GeoDataFrame with 3 rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import geopandas\n",
    "\n",
    "gdf = geopandas.GeoDataFrame.from_features({'type': 'FeatureCollection',\n",
    " 'features': [{'type': 'Feature',\n",
    "   'geometry': {'type': 'MultiPolygon',\n",
    "    'coordinates': [[[[40, 40], [20, 45], [45, 30], [40, 40]]],\n",
    "     [[[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]],\n",
    "      [[30, 20], [20, 15], [20, 25], [30, 20]]]]},\n",
    "   'properties': {'attribute': 1}},\n",
    "  {'type': 'Feature',\n",
    "   'geometry': {'type': 'Polygon',\n",
    "    'coordinates': [[[35, 10], [45, 45], [15, 40], [10, 20], [35, 10]],\n",
    "     [[20, 30], [35, 35], [30, 20], [20, 30]]]},\n",
    "   'properties': {'attribute': 2}},\n",
    "  {'type': 'Feature',\n",
    "   'geometry': {'type': 'MultiPolygon',\n",
    "    'coordinates': [[[[30, 20], [45, 40], [10, 40], [30, 20]]],\n",
    "     [[[15, 5], [40, 10], [10, 20], [5, 10], [15, 5]]]]},\n",
    "   'properties': {'attribute': 3}}]})[['attribute', 'geometry']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>attribute</th>\n",
       "      <th>geometry</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>MULTIPOLYGON (((40.00000 40.00000, 20.00000 45...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>POLYGON ((35.00000 10.00000, 45.00000 45.00000...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>MULTIPOLYGON (((30.00000 20.00000, 45.00000 40...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   attribute                                           geometry\n",
       "0          1  MULTIPOLYGON (((40.00000 40.00000, 20.00000 45...\n",
       "1          2  POLYGON ((35.00000 10.00000, 45.00000 45.00000...\n",
       "2          3  MULTIPOLYGON (((30.00000 20.00000, 45.00000 40..."
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we will look into how this \"geometry\" column with 3 geometries can be stored in an Arrow array."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constructing Arrow arrays with geometries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to construct 2 different Arrow arrays (a `ListArray` and a `StructArray`), as 2 possible representations of the multi-polygons in Arrow columnar format.\n",
    "\n",
    "Note: for the example here, we construct them from simple python lists and dicts. This is of course not how you would construct them in a performance-sensitive application (`pyarrow` offers other methods to construct those arrays more efficiently)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we create the `ListArray` ([variable sized list layout](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout)) with multiple levels of nesting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "multi_polygon_1 = [\n",
    "    [\n",
    "        [[40., 40], [20, 45], [45, 30], [40, 40]]\n",
    "    ], \n",
    "    [\n",
    "        [[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], \n",
    "        [[30, 20], [20, 15], [20, 25], [30, 20]]\n",
    "    ]\n",
    "]\n",
    "\n",
    "# using an additional level of nesting to turn the Polygon into a MultiPolygon with one part\n",
    "multi_polygon_2 = [\n",
    "    [\n",
    "        [[30, 10], [40, 40], [20, 40], [10, 20], [30, 10]]\n",
    "    ]\n",
    "]\n",
    "\n",
    "multi_polygon_3 = [\n",
    "    [\n",
    "        [[30, 20], [45, 40], [10, 40], [30, 20]]\n",
    "    ], \n",
    "    [\n",
    "        [[15, 5], [40, 10], [10, 20], [5, 10], [15, 5]]\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "array_lists = pa.array([multi_polygon_1, multi_polygon_2, multi_polygon_3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The type of the array (as inferred by pyarrow from the data in this case) is a nested list type (\"a list of list of list of doubles\", and thus very close to how the data are represented in GeoJSON):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ListType(list<item: list<item: list<item: list<item: double>>>>)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_lists.type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, we create the `StructArray` ([struct layout](https://arrow.apache.org/docs/format/Columnar.html#struct-layout)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "multi_polygon_1_as_struct = {\n",
    "  'type': 'Polygon',\n",
    "  'positions': [40., 40, 20, 45, 45, 30, 40, 40, 20, 35, 10, 30, 10, 10, 30, 5, 45, 20, 20, 35, 30, 20, 20, 15, 20, 25, 30, 20],\n",
    "  'size': 2,\n",
    "  'polygonIndices': [0, 4, 14],\n",
    "  'ringIndices': [0, 4, 10, 14]\n",
    "}\n",
    "\n",
    "multi_polygon_2_as_struct = {\n",
    "  'type': 'Polygon',\n",
    "  'positions': [30., 10, 40, 40, 20, 40, 10, 20, 30, 10],\n",
    "  'size': 2,\n",
    "  'polygonIndices': [0, 5],\n",
    "  'ringIndices': [0, 5]\n",
    "}\n",
    "\n",
    "multi_polygon_3_as_struct = {\n",
    "  'type': 'Polygon',\n",
    "  'positions': [30., 20, 45, 40, 10, 40, 30, 20, 15,  5, 40, 10, 10, 20,  5, 10, 15, 5],\n",
    "  'size': 2,\n",
    "  'polygonIndices': [0, 4, 9],\n",
    "  'ringIndices': [0, 4, 9]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "array_structs = pa.array([multi_polygon_1_as_struct, multi_polygon_2_as_struct, multi_polygon_3_as_struct])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StructType(struct<polygonIndices: list<item: int64>, positions: list<item: double>, ringIndices: list<item: int64>, size: int64, type: string>)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_structs.type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How are the Arrow arrays stored in memory?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deconstructing the nested ListArray\n",
    "\n",
    "The Arrow `ListArray` is under the hood stored as multiple arrays of offsets and one flat array of all coordinates. Those different arrays can be accessed with zero-copy in Python using the pyarrow API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "L_offsets1 = np.asarray(array_lists.offsets)\n",
    "L_offsets2 = np.asarray(array_lists.values.offsets)\n",
    "L_offsets3 = np.asarray(array_lists.values.values.offsets)\n",
    "L_coordinates = np.asarray(array_lists.values.values.values.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The flat array of all coordinates of the 3 geometries combined:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([40., 40., 20., 45., 45., 30., 40., 40., 20., 35., 10., 30., 10.,\n",
       "       10., 30.,  5., 45., 20., 20., 35., 30., 20., 20., 15., 20., 25.,\n",
       "       30., 20., 30., 10., 40., 40., 20., 40., 10., 20., 30., 10., 30.,\n",
       "       20., 45., 40., 10., 40., 30., 20., 15.,  5., 40., 10., 10., 20.,\n",
       "        5., 10., 15.,  5.])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L_coordinates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And a set of \"offsets\" of \"indices\" that allow to interpret the flat array of coordinates as a set of MultiPolygons:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GeometryIndices (index into PolygonIndices for start of new MultiPolygon):  [0 2 3 5]\n",
      "PolygonIndices (index into RingIndices for start of new polygons):  [0 1 3 4 5 6]\n",
      "RingIndices (index into coordinates for start of rings):  [ 0  8 20 28 38 46 56]\n"
     ]
    }
   ],
   "source": [
    "print(\"GeometryIndices (index into PolygonIndices for start of new MultiPolygon): \", L_offsets1)\n",
    "print(\"PolygonIndices (index into RingIndices for start of new polygons): \", L_offsets2)\n",
    "print(\"RingIndices (index into coordinates for start of rings): \", L_offsets3 * 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deconstructing the StructArray\n",
    "\n",
    "The Arrow `StructArray` is under the hood stored as separate arrays for each key of the struct (where one array holds all the values of the different structs for a single key):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "S_coordinates = np.asarray(array_structs.field(\"positions\").values)\n",
    "S_offsets1 = np.asarray(array_structs.field(\"positions\").offsets)\n",
    "S_offsets2 = np.asarray(array_structs.field(\"polygonIndices\").values)\n",
    "S_offsets3 = np.asarray(array_structs.field(\"ringIndices\").values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the flat array of all coordinates of the 3 geometries combined:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([40., 40., 20., 45., 45., 30., 40., 40., 20., 35., 10., 30., 10.,\n",
       "       10., 30.,  5., 45., 20., 20., 35., 30., 20., 20., 15., 20., 25.,\n",
       "       30., 20., 30., 10., 40., 40., 20., 40., 10., 20., 30., 10., 30.,\n",
       "       20., 45., 40., 10., 40., 30., 20., 15.,  5., 40., 10., 10., 20.,\n",
       "        5., 10., 15.,  5.])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "S_coordinates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And again a set of indices, but with a different meaning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GeometryIndices (index into coordinates for start of new MultiPolygon):  [ 0 28 38 56]\n",
      "PolygonIndices (index into coordinates *subset* for start of new polygons):  [ 0  8 28  0 10  0  8 18]\n",
      "RingIndices (index into coordinates *subset* for start of rings):  [ 0  8 20 28  0 10  0  8 18]\n"
     ]
    }
   ],
   "source": [
    "print(\"GeometryIndices (index into coordinates for start of new MultiPolygon): \", S_offsets1)\n",
    "print(\"PolygonIndices (index into coordinates *subset* for start of new polygons): \", S_offsets2 * 2)\n",
    "print(\"RingIndices (index into coordinates *subset* for start of rings): \", S_offsets3 * 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So some notes about those two representations:\n",
    "\n",
    "* In both cases, we have the same flat array of coordinates available (this is the bulk of the actual data)\n",
    "* The exact indices are different. Some differences:\n",
    "  * For the StructArray, all the PolygonIndies and RingIndices are *relative* indices, that are only valid after taking the subset of the coordinates array for the specific geometry\n",
    "  * For the ListArray, the indices are indexing into one of the other arrays of indices, except for the RingIndices, which are pointing into the array of coordinates\n",
    "\n",
    "It probably depends on the specicific application which of the two representation of indices is most handy to work with. \n",
    "\n",
    "But, there is also no fundamental differences, as it is *relatively* easy to get the one representation from the other or the other way around."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, getting the index where each new MultiPolygon starts in the coordinates array in both cases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0, 14, 19, 28], dtype=int32)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L_offsets3[L_offsets2][L_offsets1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0, 14, 19, 28], dtype=int32)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "S_offsets1 // 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or for the RingIndices, with some gymnastics, one can convert the \"relative\" indices from the StructArray into absolute indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  8, 20, 28, 38, 46, 56], dtype=int32)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "L_offsets3 * 2  # directly pointing into the coordinates array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  8, 20, 28,  0, 10,  0,  8, 18])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "S_offsets3 * 2  # those are relative"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 4, 6, 9], dtype=int32)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the offsets of the RingIndices (where each new geometry (MultiPolygon) starts)\n",
    "S_offsets3_offsets = np.asarray(array_structs.field(\"ringIndices\").offsets)\n",
    "S_offsets3_offsets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0, 28, 38, 56], dtype=int32)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the start index into the coordinates array per geometry (what we need to add to the ring indices)\n",
    "S_offsets1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  0,  0,  0, 28, 28, 38, 38, 38], dtype=int32)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we need to repeat those start indices above for the number of rings\n",
    "ring_indices_starts = np.repeat(S_offsets1[:-1], np.diff(S_offsets3_offsets))\n",
    "ring_indices_starts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  8, 20, 28, 28, 38, 38, 46, 56])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and now we can add those\n",
    "(S_offsets3 * 2) + ring_indices_starts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  8, 20, 28, 38, 46, 56], dtype=int32)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# for comparison, the ones from the ListArray\n",
    "L_offsets3 * 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The only remaining difference is that the end / start index is repeated where a new MultiPolygon starts (this could also be avoid by not string the last index (superfluous) index in the \"ringIndices\" field in the structs)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Memory usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare the memory usage of both representations (although for such a toy example, this might not be that informative of the difference in \"real\" situations):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "632"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_lists.nbytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "693"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_structs.nbytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The StructArray is a bit bigger, but we are not making a fully fair comparison. for both ListArray as StructArray we can optimize this a bit further:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the inner lists (x, y) pairs can be left out, *if* we store metadata about the number of dimensions somewhere\n",
    "multi_polygon_1 = [\n",
    "    [\n",
    "        [40, 40, 20, 45, 45, 30, 40, 40]\n",
    "    ], \n",
    "    [\n",
    "        [20, 35, 10, 30, 10, 10, 30, 5, 45, 20, 20, 35], \n",
    "        [30, 20, 20, 15, 20, 25, 30, 20]\n",
    "    ]\n",
    "]\n",
    "\n",
    "multi_polygon_2 = [\n",
    "    [\n",
    "        [30, 10, 40, 40, 20, 40, 10, 20, 30, 10]\n",
    "    ]\n",
    "]\n",
    "\n",
    "multi_polygon_3 = [\n",
    "    [\n",
    "        [30, 20, 45, 40, 10, 40, 30, 20]\n",
    "    ], \n",
    "    [\n",
    "        [15, 5, 40, 10, 10, 20, 5, 10, 15, 5]\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "array_lists = pa.array([multi_polygon_1, multi_polygon_2, multi_polygon_3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "516"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_lists.nbytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for a fairer comparison, we remove the duplicated \"type\" and \"size\" fields (assumeing this could also be stored in metadata), and use int32 for the indices:\n",
    "multi_polygon_1_as_struct = {\n",
    "  'positions': [40, 40, 20, 45, 45, 30, 40, 40, 20, 35, 10, 30, 10, 10, 30, 5, 45, 20, 20, 35, 30, 20, 20, 15, 20, 25, 30, 20],\n",
    "  'polygonIndices': np.array([0, 4, 14], dtype=\"int32\"),\n",
    "  'ringIndices': np.array([0, 4, 10, 14], dtype=\"int32\"),\n",
    "}\n",
    "\n",
    "multi_polygon_2_as_struct = {\n",
    "  'positions': [30, 10, 40, 40, 20, 40, 10, 20, 30, 10],\n",
    "  'polygonIndices': np.array([0, 5], dtype=\"int32\"),\n",
    "  'ringIndices': np.array([0, 5], dtype=\"int32\")\n",
    "}\n",
    "\n",
    "multi_polygon_3_as_struct = {\n",
    "  'positions': [30, 20, 45, 40, 10, 40, 30, 20, 15,  5, 40, 10, 10, 20,  5, 10, 15, 5],\n",
    "  'polygonIndices': np.array([0, 4, 9], dtype=\"int32\"),\n",
    "  'ringIndices': np.array([0, 4, 9], dtype=\"int32\")\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "array_structs = pa.array([multi_polygon_1_as_struct, multi_polygon_2_as_struct, multi_polygon_3_as_struct])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "564"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array_structs.nbytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (arrow-dev)",
   "language": "python",
   "name": "arrow-dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
No results found