Skip to content

Instantly share code, notes, and snippets.

@sharkinsspatial
Last active December 3, 2024 21:16
Show Gist options
  • Save sharkinsspatial/f1c3a8f871b58416fa30c377178b5f9c to your computer and use it in GitHub Desktop.
Save sharkinsspatial/f1c3a8f871b58416fa30c377178b5f9c to your computer and use it in GitHub Desktop.

Holy grail

Before diving too deeply into the various friction points when working with archives of earth observation data in xarray, let's look at a more optimal case from the earth systems world. In the notebook here we demonstrate how using zarr's consolidated metadata option to access the dimensional and chunk reference information, a massive dataset's dimensions and variables can be loaded extremely quickly. With this consolidated metadata available to reference chunks on disk, we can leverage xarray's dask integration to use normal xarray operations to lazily load chunks in parallel and perform our calculations using dask's blocked algorithm implementations. Gravy.

Challenges

But the earth observation story is more complicated... Not everything lives in standardized file containers and more importantly our grid coordinate systems are "all over the map" :] Here are some of the current challenges.

  1. Consolidated metdata. Arrays of STAC items using projection extension actually provide a rough facsimile of zarr's consolidated metadata for referencing bytes in cloud storage. When referencincg COGs this is an even closer approximation as we can use windowed reads to access smaller chunks even for large file containers. Consistent STAC metadata is key for providing lazy chunked access to large EO data archives.

  2. CRS Representation. xarray's current representational model does not provide an explicit approach for storing CRS information for a dimension coordinates for use with an index. One of xarray's most valuable features is alignment on index coordinates but its utility is limited with disparate CRSs due to the requirement for pre-processing into common coordinate systems. Several projects have used bolt on approaches to manage CRS information but they either have limitations, such as rioxarray's reliance on rasterio and additional convenience methods which require geospatial knowledge. Or geoxarray which paused development due to technical difficulties but there are some exciting new developments on this front, more on this later.

  3. GDAL. There is a nightmare mismash of formats for EO data that have proliferated over the years. Luckily we have GDAL to provide us a performant, consistent access model for these formats. But despite it's robustness and flexibility GDAL/Rasterio have 2 issues which make them less than ideal for use in large parallelized read scenarios with Dask. The first is the thread safety restriction that GDAL Dataset handles cannot be shared between threads. This issue is especially relevant with dask as in many cases it needs to opaquely support underlying ThreadPoolExecutor models and a variety of multi-processing models (which in turn themselves might use threads internally) leading to a whole host of hard to reproduce and interesting failure scenarios. I'll let @vincentsarago weigh in with more info here. Gabe Joseph from Coiled has been providing some lower level thread management in the excellent stackstac but there may be other options to consider here as there is overhead with multiple dataset handle management. Probably more important and closer to @vincentsarago's heart is the lack of true asyncio mechanisms with GDAL/Rasterio. Though this is a bit of a moot point without a non-GDAL I/O library given the blocking issues at the GDAL level see here for more in depth discussion.

Where Do We Go From Here

So this is a long list of problems with not much talk about solutions. Where do we want to go? I initially wrote some of these notes a few weeks ago, but I got scooped by some smarter folks in this discussion. My vision is that a scientist can

  1. Use a list of STAC items with the projection extension with an xarray convenience method to create an xarray dataset with concatenated spatial dimensions.
  2. Seamlessly use a pure asyncio I/O library like aicogeo to support dynamic reads in parallel dask workers.
  3. Use automated CRS aware coordinate index alignment via native xarray resampling that users do not have to be explictly aware of.
  4. Use CRS aware indexes for selection and use with core xarray operations.
  5. Potentially have the ability to use a classic non-asyncio driver such as rasterio to save analysis data in EO formats as rioxarray does but with blocked parallel algorithms that could be constructed in a dask graph.

I think we are at a very exciting spot right now and there are new developments that can facilitate this. I would like to see us focus our efforts and thinking around the slightly longer game of helping these developments mature so in the end we can have the community using a consistent mental model and set of tools. I also feel that given the nature of the issues it might be better to focus on smaller modular tooling than more monolithic solutions like rioxarray for the future. What is happening and needs to happen?

  1. Most vital is the current index refactoring project occurring in xarray that will facilitate the option of CRS aware indexes. Read this and this for further context. I don't have the skills to support the xarray efforts but I do think we can assist @djhoese as he works to implement CRS aware indexes and grid resampling. I would love us to coordinate more with him on this.

  2. Work to battle harden aicogeo and make it usable across a variety of authentication and storage schemes like GDAL can. I think in the near term all of this work should focus on building a pure non Rasterio I/O implementation specifically for COGs while we use fallback Raterio/GDAL I/O operations for legacy formats and continue to push the community to COGify the world.

  3. Investigate helping Microsoft coordinate funding for this work. We at DevSeed are probably not the best ones to actually implement this code but I think this approach is key for making the Planetary Computer vision of seamless hyper large scale anaylsis of EO data a frictionless reality for scientists.

  4. Focus the community. There is a lot of uncertainty around the path forward in this space at the moment. I would love to see someone smarter than me take this distillation and turn it into something more cohesive that could be used to guide development planning and work going forward.

@djhoese
Copy link

djhoese commented Jun 2, 2021

Just in case this group isn't aware, I started a PR for rasterio to read open file objects (like those from fsspec) into GDAL directly: rasterio/rasterio#2141. Obviously aiocogeo sounds way more performant, but this might be a step in the middle of that "COGify the world" process.

@sharkinsspatial
Copy link
Author

Including a reference to the xarray flexible index work for tracking purposes https://github.com/pydata/xarray/blob/main/design_notes/flexible_indexes_notes.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment