Skip to content

Instantly share code, notes, and snippets.

@betolink
Last active July 2, 2024 17:56
Show Gist options
  • Save betolink/435871be48a272d2694a6ae190ae289d to your computer and use it in GitHub Desktop.
Save betolink/435871be48a272d2694a6ae190ae289d to your computer and use it in GitHub Desktop.
xarray remote access
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "fe17af6b-6678-4ee6-866c-fd1c53787c15",
"metadata": {},
"source": [
"# Access Patterns to Remote Data with *fsspec*\n",
"\n",
"Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the [CMIP6 tutorial](remote-data.ipynb) shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not be available in such formats. \n",
"\n",
"This notebook will explore how we can levarage xarray's backends to access remote files. For this we will make use of [`fsspec`](https://github.com/fsspec/filesystem_spec), a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many file-format specific libraries.\n",
"\n",
"Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. Let's consider a scenario where we have a local NetCDF4 file containing gridded data. NetCDF is a common file format used in scientific research for storing array-like data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c527e9e-cf5f-46e8-bfb4-72301ee51037",
"metadata": {},
"outputs": [],
"source": [
"import xarray as xr\n",
"\n",
"ds = xr.open_dataset(\"../data/sst.mnmean.nc\")\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "5a44116d-49b4-4e87-b505-7dca4914ab66",
"metadata": {},
"source": [
"### xarray backends under the hood\n",
"\n",
"* What happened when we ran `xr.open_dataset(\"path-to-file\")`? \n",
"\n",
"As we know xarray is a very flexible and modular library. When we open a file, we are asking xarray to use one of its format specific engines to get the actual array data from the file into memory. File formats come in different flavors, from general purpose HDF5 to the very domain-specific ones like GRIB2. When we call `open_dataset()` the first thing xarray does is trying to guess which of the preinstalled backends can handle this file, in this case we passed a string with a valid local path.\n",
"\n",
"We'll use a helper function to print a simplified call stack and see what's going on under the hood. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01271135-40fe-49b6-8746-e5061130ccbf",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"from IPython.display import Code\n",
"\n",
"\n",
"tracing_output = []\n",
"_match_pattern = \"xarray\"\n",
"\n",
"def trace_calls(frame, event, arg):\n",
" if event == 'call':\n",
" code = frame.f_code\n",
" func_name = code.co_name\n",
" func_file = code.co_filename.split(\"/site-packages/\")[-1]\n",
" func_line = code.co_firstlineno\n",
" if not func_name.startswith(\"_\") and _match_pattern in func_file:\n",
" tracing_output.append(f\"def {func_name}() at {func_file}:{func_line}\")\n",
" return trace_calls\n",
"\n",
"# we enable tracing and call open_dataset()\n",
"sys.settrace(trace_calls)\n",
"ds = xr.open_dataset(\"../data/sst.mnmean.nc\")\n",
"sys.settrace(None)\n",
"\n",
"# Print the trace with some syntax highlighting \n",
"Code(\" \\n\".join(tracing_output[0:10]), language='python') "
]
},
{
"cell_type": "markdown",
"id": "b2d90714-544e-4653-a391-204f94e767b5",
"metadata": {},
"source": [
"#### **What are we seeing?** \n",
"\n",
"* xarray uses `guess_engine()` to identify which backend can open the file.\n",
"* `guess_engine()` will looop through the preinstalled backends and will run `guess_can_open()`.\n",
"* if an engine can handle the file type it will verify that we are working with a local file.\n",
"* Once that we know which backend we'll use we invoke that backend implementation of `open_dataset()`.\n",
"\n",
"Let's tell xarray which backend we need for our local file. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7376766-ea7e-4546-a082-5da4fc733ea8",
"metadata": {},
"outputs": [],
"source": [
"tracing_output = []\n",
"\n",
"sys.settrace(trace_calls)\n",
"ds = xr.open_dataset(\"../data/sst.mnmean.nc\", engine=\"h5netcdf\")\n",
"sys.settrace(None)\n",
"\n",
"# Print the top 10 calls to public methods\n",
"Code(\" \\n\".join(tracing_output[0:10]), language='python') "
]
},
{
"cell_type": "markdown",
"id": "9104e9b6-75cc-4e07-9424-e6c1bcb5b203",
"metadata": {},
"source": [
"> It is important to note that there are overlaps between the pre-installed backends in xarray. Many of these backends support the same formats (e.g., NetCDF-4), and xarray uses them in a specific order unless a particular backend is specified. For example, when we request the h5netcdf engine, xarray will not attempt to guess the backend. However, it will still check if the URI is remote, which will involve some calls to a context manager. By examining the call stack, we can observe the use of a file handler and a cache, which are crucial for efficiently accessing remote files."
]
},
{
"cell_type": "markdown",
"id": "e124c821-0efc-4c82-83ab-dda94cd16754",
"metadata": {},
"source": [
"### Supported file formats by backend\n",
"\n",
"The `open_dataset()` method is our entry point to n-dimensional data with xarray, the first argument we pass indicates what we want to open and is used by xarray to get the right backend and in turn is used by the backend to open the file locally or remote. The accepted types by xarray are:\n",
"\n",
"\n",
"* **str**: \"my-file.nc\" or \"s3:://my-zarr-store/data.zarr\"\n",
"* **os.PathLike**: Posix compatible path, most of the times is a Pathlib cross-OS compatible path.\n",
"* **BufferedIOBase**: some xarray backends can read data from a buffer, this is key for remote access.\n",
"* **AbstractDataStore**: This one is the generic store and backends should subclass it, if we do we can pass a \"store\" to xarray like in the case of Opendap/Pydap\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59a19602-1ad4-4d94-9448-4ddd7aad1631",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Listing which backends we have available, if we install more they should show up here.\n",
"xr.backends.list_engines()"
]
},
{
"cell_type": "markdown",
"id": "986c6a11-d4a7-4969-b2ac-e3200820fa97",
"metadata": {},
"source": [
"### Trying to access a file on cloud storage (AWS S3)\n",
"\n",
"Now let's try to open a file on a remote file system, this will fail and we'll take a look into why it failed and how we'll use fsspec to overcome this."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c15211e6-5d8c-4ac6-96ff-bdd2dadf721d",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" ds = xr.open_dataset(\"s3://its-live-data/test-space/sample-data/sst.mnmean.nc\")\n",
"except Exception as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"id": "d6372a94-50e7-458d-a764-6ba5570e164e",
"metadata": {},
"source": [
"xarray iterated through the registered backends and netcdf4 returned a `\"yes, I can open that extension\"` see: [netCDF4_.py#L618 ](https://github.com/pydata/xarray/blob/6c2d8c3389afe049ccbfd1393e9a81dd5c759f78/xarray/backends/netCDF4_.py#L618). However, **the backend doesn't know how to \"talk\" to a remote store** and thus it fails to open our file.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "f16f114c-41a0-4c33-b33d-685d2382b1a3",
"metadata": {},
"source": [
"## Supported format + Read from Buffers = Remote access \n",
"\n",
"Some of xarray's backends can read and write data to memory, this coupled with fsspec's ability to abstract remote files allows us to **access remote files as if they were local**. The following table helps us to identify if a backend can be used to access remote files with fsspec.\n",
"\n",
"\n",
"| Backend | HDF/NetCDF Support | Can Read from Buffer | Handles Own I/O |\n",
"|-----------------|--------------------|----------------------|-----------------|\n",
"| netCDF4 | Yes | No | Yes |\n",
"| scipy | Limited | Yes | Yes |\n",
"| pydap | Yes | No | No |\n",
"| h5netcdf | Yes | Yes | Yes |\n",
"| zarr | No | Yes | Yes |\n",
"| cfgrib | Yes | No | Yes |\n",
"| rasterio | Partial | Yes | No |\n",
"\n",
"\n",
"\n",
"**Can Read from Buffer**: Libraries that can read from buffers do not need to open a file using the operating system machinery and they allow the use of memory to open our files in whatever way we want as long as we have a seekable buffer (random access). \n",
"\n",
"**Handles Own I/O**: Some libraries have self contained code that can handle I/O, compression, codecs and data access. Some engines task their I/O to lower level libraries. This is the case with rasterio that uses GDAL to access raster files. If a Library is in control of its own I/O operations can be easily adapted to read from buffers.\n",
"\n",
"```mermaid\n",
"graph TD\n",
" A[\"netCDF-4 (.nc, .nc4) and most HDF5 files\"] -->|netcdf4| B[\"Remote Access: No\"]\n",
" A -->|h5netcdf| C[\"Remote Access: Yes\"]\n",
" \n",
" D[\"netCDF files (.nc, .cdf, .gz)\"] -->|scipy| E[\"Remote Access: Yes\"]\n",
" \n",
" F[\"zarr files (.zarr)\"] -->|zarr| G[\"Remote Access: Yes\"]\n",
"\n",
" H[\"OpenDAP\"] -->|pydap| I[\"Remote Access: Yes\"]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "bc0b8b57-3a60-4d35-ba96-7daf4abd830f",
"metadata": {},
"source": [
"## Remote Access and File Caching\n",
"\n",
"When we use fsspec to abstract a remote file we are in essence translating byte requests to HTTP range requests over the internet. An HTTP request is a costly I/O operation compared to accessing a local file. Because of this, it's common that libraries that handle over the network data transfers implement a cache to avoid requesting the same data over and over. In the case of fsspec there are different ways to ask the library to handle this **caching and this is one of the most relevant performance considerations** when we work with xarray and remote data. \n",
"\n",
"fsspec default cache is called `read-ahead` and as its name suggests it will read ahead of our request a fixed amount of bytes, this is good when we are working with text or tabular data but it's really an anti pattern when we work with scientific data formats. [Benchmaks show]() that any of the caching schemas will perform better than using the default `read-ahead`. \n",
" \n",
"\n",
"### fsspec caching implementations.\n",
"\n",
"#### simple cache + `open_local()`\n",
"\n",
"The simplest way to use fsspec is to cache remote files locally. Since we are using a local storage for our cache, backends like `netcdf4` will be reading from disk avoiding the issue of not being able to read directly from buffers. This pattern can be applied to different backends that don't support buffers with the disadvantage that we'll be caching whole files and using disk space. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e028466-0562-4cb6-84bd-ef8551509653",
"metadata": {},
"outputs": [],
"source": [
"import fsspec\n",
"\n",
"\n",
"uri = \"https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc\"\n",
"# we prepend the cache type to the URI, this is called protocol chaining in fsspec-speak\n",
"file = fsspec.open_local(f\"simplecache::{uri}\", filecache={'cache_storage':'/tmp/fsspec_cache'})\n",
"\n",
"ds = xr.open_dataset(file, engine=\"netcdf4\")\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "3fde50d4-7523-4c4e-af33-d2588a2b6670",
"metadata": {},
"source": [
"#### block cache + `open()`\n",
"\n",
"If our backend support reading from a buffer we can cache only the parts of the file that we are reading, this is useful but tricky. As we mentioned before fsspec default cache will request an overhead of 5MB ahead of the byte offset we request, and if we are reading small chunks from our file it will be really slow and incur in unecessary transfers. \n",
"\n",
"Let's open the same file but using the `h5netcdf` engine and we'll use a block cache strategy that stores predefined block sizes from our remote file. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7198585-b995-47b2-9ea0-4e4ccf056df0",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"uri = \"https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc\"\n",
"\n",
"fs = fsspec.filesystem('http')\n",
"\n",
"fsspec_caching = {\n",
" \"cache_type\": \"blockcache\", # block cache stores blocks of fixed size and uses eviction using a LRU strategy. \n",
" \"block_size\": 8*1024*1024 # size in bytes per block, adjust depends on the file size but the recommended size is in the MB \n",
"}\n",
"\n",
"# Note that if we use a context, we'll close the file after the block so operations on xarray may fail if we don't load our data arrays.\n",
"with fs.open(uri, **fsspec_caching) as file:\n",
" ds = xr.open_dataset(file, engine=\"h5netcdf\")\n",
" mean = ds.sst.mean()\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "02283d10-0314-42a9-9671-2197a14c795f",
"metadata": {},
"source": [
"### Reading data from cloud storage\n",
"\n",
"So far we have only used HTTP to access a remote file, however the commercial cloud has their own implementations with specific features. fsspec allows us to talk to different cloud storage implementations hiding these details from us and the libraries we use. Now we are going to access the same file using the S3 protocol. \n",
"\n",
"> Note: S3, Azure blob, etc all have their names and prefixes but under the hood they still work with the HTTP protocol.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af6915ca-0bb0-47d8-b7b7-b3c4f0e48dd1",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"uri = \"s3://its-live-data/test-space/sample-data/sst.mnmean.nc\"\n",
"\n",
"# If we need to pass credentials to our remote storage we can do it here, in this case this is a public bucket\n",
"fs = fsspec.filesystem('s3', anon=True)\n",
"\n",
"fsspec_caching = {\n",
" \"cache_type\": \"blockcache\", # block cache stores blocks of fixed size and uses eviction using a LRU strategy. \n",
" \"block_size\": 8*1024*1024 # size in bytes per block, adjust depends on the file size but the recommended size is in the MB \n",
"}\n",
"\n",
"# we are not using a context, we can use ds until we manually close it.\n",
"ds = xr.open_dataset(fs.open(uri, **fsspec_caching), engine=\"h5netcdf\")\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "70a744c8-f1fb-4824-a681-e9fa50108ee1",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **fsspec and remote access.**\n",
"\n",
">fsspec is a Python library that provides a unified interface to various filesystems, enabling access to local, remote, and cloud storage systems.\n",
"It supports a wide range of protocols such as http, https, s3, gcs, ftp, and many more.\n",
"One of the key features of fsspec is its ability to cache remote files locally, improving performance by reducing latency and bandwidth usage.\n",
"\n",
"2. **xarray Backends.**\n",
"\n",
">xarray backends offers flexible support for opening datasets stored in different formats and locations.\n",
"By leveraging various backends alons with fsspec we can open, read, and analyze complex datasets efficiently, without worrying about the underlying file format or storage mechanism.\n",
"\n",
"3. **Combining fsspec with xarray**\n",
"\n",
"> xarray can work with fsspec filesystems to open and cache remote files and use caching strategies to optimize its data transfer.\n",
"\n",
"\n",
"\n",
"By leveraging these tools and techniques, you can efficiently manage and process large, remote datasets in a way that optimizes performance and accessibility.\n",
"\n",
"\n",
"![xarray-access(2)](https://gist.github.com/assets/717735/e005ac17-9030-4adc-ab6a-290f82b50228)\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@betolink
Copy link
Author

betolink commented Jul 2, 2024

xarray-access(2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment