Skip to content

Instantly share code, notes, and snippets.

@jpivarski
Last active June 21, 2022 18:32
Show Gist options
  • Save jpivarski/30c2671c6860393974ff3db2891f20ed to your computer and use it in GitHub Desktop.
Save jpivarski/30c2671c6860393974ff3db2891f20ed to your computer and use it in GitHub Desktop.
Awkward Arrays in spatialpandas

I managed to iterate over Awkward Arrays and rasterize the NYC buildings as polygons. The spatialpandas code is pretty well integrated with the ragged data structures you've built; there's a lot of code that twiddles offset arrays. I couldn't use the build_polygon function directly, but ported over enough of it into my own Numba-compiled function to reproduce the output.

These are Matplotlib's imshow displays of images made by iterating over Awkward Arrays; the axes are flipped from the normal longitude, latitude because I'm just dumping the array as an image, but I verified on one complex building that I am exactly reproducing spatialpandas's output (including the short-circuit code paths, in which a polygon is smaller than a pixel). The first is low-resolution and the second is high-resolution, the minimum and maximum number of pixels in the performance studies later in this email.


Now for performance: nothing is lost by switching to Awkward Array. Unfortunately, I just realized that I'm measuring less work in Awkward's case than spatialpandas, so we need to mentally note that, but it doesn't interfere with the conclusion.

What was measured for spatialpandas: https://github.com/holoviz/datashader/blob/ba23c79c7559cd17c788875165a08cb9b338cd15/datashader/glyphs/polygon.py#L235-L266

  1. do a first pass to determine the size of temporary arrays, O(n) for n polygons, only touches the offsets, not the longitude, latitude data
  2. allocate temporary arrays, once
  3. iterate over polygons, rasterizing them, O(n) for n polygons; I'm not sure of the time complexity for pixels

What was measured for awkward: https://github.com/holoviz/datashader/blob/ba23c79c7559cd17c788875165a08cb9b338cd15/datashader/glyphs/polygon.py#L255-L266

  1. iterate over polygons, rasterizing them, O(n) for n polygons; I'm not sure of the time complexity for pixels

So the ~1.1 seconds of start-up time in the scaling with number of pixels for spatialpandas (y-intercept, below) may be due to the extra steps that spatialpandas is doing, but even still, the following non-Numba expression from Awkward Array does that work in 0.052 seconds:

max_edges = ak._v2.max(ak._v2.num(geometry, axis=3) // 2)   # determine max_edges in a one-liner

xs = np.full((max_edges, 2), np.nan, dtype=np.float32)
ys = np.full((max_edges, 2), np.nan, dtype=np.float32)
yincreasing = np.zeros(max_edges, dtype=np.int8)
eligible = np.ones(max_edges, dtype=np.int8)

so even including the max_edges determination and allocation in the Awkward Array test wouldn't add much to the orange line. (It's also possible that doing the max_edges determination in Numba would be faster than the one-liner because there would be fewer passes over the data, but maybe it won't because the ak.max and ak.num functions can take advantage of vectorization.)

In all tests, the time spent JIT-compiling functions was excluded. It would add a second or two, but only once per session.

Scaling with the number of polygons is also important; for this I used the maximum number of pixels (just as the scaling with pixels plot used the maximum number of polygons).

Then there's also the "read and prepare data" time. The following are with all of the Parquet files completely in OS cache (verified with vmtouch).

For spatialpandas, it's 460 ms:

ddf = spd.io.read_parquet("nyc_buildings.parq")
polys = hv.Polygons(ddf)

For Awkward Array, it's 506 ms:

array = ak._v2.from_parquet("nyc_buildings.parq")
geometry = ak._v2.fill_none(array.geometry, 999, axis=-1)

Part of this time is spent concatenating bitmasks because the geometry data are option-type at every level (the type is "option[var * option[var * option[var * option[float64]]]]"). In this dataset, I think we don't even want option-type data: none of the longitude, latitude points are actually missing, and neither are any of the lists (or lists of lists, or lists of lists of lists). ak.from_parquet ought to have an ignore_optiontype that can be applied to sets of columns, because we do want the option-type data for the "building type" strings; a lot of those are masked out to indicate "unknown".

The ak.fill_none call, above, is to remove the option-type from the numerical data, since it changes the implementation of draw_polygons (we'd have to add a code path to handle the "longitude, latitude is missing" case). In fact, leaving the option-types on the lists is slowing down Awkward iteration in Numba because those lists are potentially missing; the runtime code has to check each one to see if it needs to raise an exception when I try to iterate over None. In other studies, I found that time to be insignificant, so I didn't bother to take it out of the performance tests in the previous section.

To see what time would be gained by simply not including these unnecessary option-types, I temporarily hacked it in and then measured the following line to be 376 ms.

array = ak._v2.from_parquet("nyc_buildings.parq", ignore_optiontype=["geometry"])

(Technical note: "concatenating bitmasks" is expensive because the lengths of data they represent are not guaranteed to be multiples of 8. Instead of leaving them as bitmasks, they're expanded into a more flexible data structure, but all that work is completely unnecessary for this problem.)


Getting back to the big picture, Awkward Arrays can definitely act as a base layer for a future spatialpandas. Some steps simplify (and might be faster) outside of Numba-compiled functions—simple things like getting the maximum number of edges to allocate some arrays—and many of the complex Numba-compiled functions can be used once their "offsets and values" logic is removed, replacing it with easier-to-read iterators. The code would be more maintainable, but getting it to that point would be a considerable amount of work.

In performance, nothing is lost, though some additional functionality in Awkward Array may be needed to do so, such as skipping unwanted masks at read-time. Small images and small polygon sets might get a little more streamlined (it's too early to say, since Awkward Array's test was in a reimplemented sandbox, not the full framework), but in the limits that are dominated by the algorithms, there's not much difference—as expected, actually.

Having gone through this, it looks to me like refactoring spatialpandas on Awkward Array would be a lot of work: something on the scale of a reimplementation. The "offsets and values" logic wasn't abstracted out—it's everywhere.

As for wrapping Awkward Arrays as Pandas ExtensionArrays, we'll be doing it—motivated by this project, but we'll probably need it for other projects as well. We'd be doing it as a separate package, so the distinction between an ak.Array and an Awkward Series would be user-visible. I've talked with Martin about that, since it has implications for Dask integration: Awkward, Awkward-Pandas, Awkward-Dask, Awkward-Pandas-Dask (in dask.dataframe) are all relevant combinations.

Let me know what you hear from your colleagues, how probable you think it would be to do this refactoring. We could also meet at some point and I can answer questions about the proof of concept and performance measurements above. As before, you can freely distribute this email.

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment