Skip to content

Instantly share code, notes, and snippets.

@mdsumner
Last active June 19, 2026 07:36
Show Gist options
  • Select an option

  • Save mdsumner/e69d1548bb08fa29f6c1192791fd7ce0 to your computer and use it in GitHub Desktop.

Select an option

Save mdsumner/e69d1548bb08fa29f6c1192791fd7ce0 to your computer and use it in GitHub Desktop.

NOAA GEFS analysis on dynamical.org in earthmover marketplace: https://app.earthmover.io/marketplace/6970566255e09e23d5bcbbc0?tab=dataset

Open an icechunk store in gdalxarray, using the icechunk GDAL driver

# docker run --rm -ti ghcr.io/hypertidy/gdal-r-python:dev 
## uv pip install gdalxarray (if docker image not updated yet)

import xarray
import os
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
os.environ["AWS_REGION"] = "us-west-2"
import xarray
ds = xarray.open_dataset("/vsis3/dynamical-noaa-gefs/noaa-gefs-analysis/v0.1.2.icechunk", engine = "gdalxarray")

<xarray.Dataset> Size: 7TB
Dimensions:                                     (latitude: 721,
                                                 longitude: 1440, time: 77323)
Coordinates:
  * latitude                                    (latitude) float64 6kB 90.0 ....
  * longitude                                   (longitude) float64 12kB -180...
  * time                                        (time) datetime64[ns] 619kB 2...
Data variables: (12/23)
    categorical_freezing_rain_surface           (time, latitude, longitude) float32 321GB ...
    categorical_ice_pellets_surface             (time, latitude, longitude) float32 321GB ...
    categorical_rain_surface                    (time, latitude, longitude) float32 321GB ...
    categorical_snow_surface                    (time, latitude, longitude) float32 321GB ...
    downward_long_wave_radiation_flux_surface   (time, latitude, longitude) float32 321GB ...
    downward_short_wave_radiation_flux_surface  (time, latitude, longitude) float32 321GB ...
    ...                                          ...
    temperature_2m                              (time, latitude, longitude) float32 321GB ...
    total_cloud_cover_atmosphere                (time, latitude, longitude) float32 321GB ...
    wind_u_100m                                 (time, latitude, longitude) float32 321GB ...
    wind_u_10m                                  (time, latitude, longitude) float32 321GB ...
    wind_v_100m                                 (time, latitude, longitude) float32 321GB ...
    wind_v_10m                                  (time, latitude, longitude) float32 321GB ...
Attributes:
    attribution:         NOAA NWS NCEP GEFS data processed by dynamical.org f...
    dataset_id:          noaa-gefs-analysis
    dataset_version:     0.1.2
    description:         Weather analysis from the Global Ensemble Forecast S...
    license:             CC-BY-4.0
    name:                NOAA GEFS analysis
    spatial_domain:      Global
    spatial_resolution:  0.25 degrees (~20km)
    time_domain:         2000-01-01 00:00:00 UTC to Present
    time_resolution:     3.0 hours
@mdsumner

mdsumner commented Jun 19, 2026

Copy link
Copy Markdown
Author

note that the above is chunked like

(time, latitude, longitude) float32 321GB dask.array<chunksize=(77327, 721, 1440), meta=np.ndarray>

but we can't set chunks = {} atm (iiuc) - workaround is to gdal mdim convert {dsn} **gefs.vrt* --config etc (that worked), and so see that extracting values in a temporal direction is fast (don't try to extract a single time slice):

ds.wind_u_10m.isel(longitude = 100, latitude = 500).values
array([      nan,  4.4375  ,  4.3125  , ..., -3.0625  , -1.515625,
       -1.25    ], shape=(77327,), dtype=float32)



ds.wind_u_10m.isel(longitude = 100, latitude = 500, time = slice(0, 50)).values
array([        nan,  4.4375    ,  4.3125    ,  4.5625    ,  4.9375    ,
        4.375     ,  3.28125   ,  3.90625   ,  3.90625   ,  4.125     ,
        3.53125   ,  2.96875   ,  3.46875   ,  3.09375   ,  0.9609375 ,
        1.09375   ,  1.546875  ,  0.53125   , -0.390625  , -0.3828125 ,
        0.24804688,  0.3359375 , -1.578125  , -0.9296875 , -0.13085938,
       -0.24609375, -1.8125    , -1.1875    , -0.6328125 ,  0.5546875 ,
       -1.046875  , -1.40625   , -1.296875  , -1.21875   , -1.921875  ,
       -1.4375    , -1.890625  , -1.96875   , -2.0625    , -1.109375  ,
        2.34375   ,  2.25      ,  2.96875   ,  4.4375    ,  6.125     ,
        6.1875    ,  4.8125    ,  5.8125    ,  4.625     ,  6.625     ],
      dtype=float32)

@mdsumner

Copy link
Copy Markdown
Author

wrapper for user-ux

"""
open_icechunk — collapse the Icechunk storage→repo→session ceremony into one call
that takes the s3:// path you already have in hand.

Verified assumptions (AWS S3, native Icechunk path, 2026-06):
  - anonymous reads work for public stores (anonymous=True)
  - region is auto-resolved via the S3 301 redirect, so region may be omitted
  - bucket + prefix are distinct fields in ic.s3_storage; an s3:// URL must be split
These hold for *real* AWS S3. For S3-compatible stores (Pawsey Acacia, R2, MinIO,
source.coop-native) you must pass endpoint_url, and region is usually a placeholder.

Design notes:
  - anonymous (identity) and read/write (session intent) are INDEPENDENT axes.
    anonymous=True does not "mean read-only"; it means "no credentials". The
    read-only-ness of public buckets comes from their IAM policy, not from this flag.
    So `writable` is a separate argument, never inferred from `anonymous`.
  - region=None lets the client auto-resolve (works on AWS). Pass region explicitly
    for custom endpoints or if a backend (e.g. GDAL /vsis3/) needs it.

Requires: icechunk, xarray, zarr.
"""

from __future__ import annotations

from typing import Any
from urllib.parse import urlparse

import icechunk as ic
import xarray as xr


def parse_s3_path(path: str) -> tuple[str, str]:
    """Split an s3://bucket/prefix... (or bare bucket/prefix) into (bucket, prefix).

    >>> parse_s3_path("s3://dynamical-noaa-gefs/noaa-gefs-analysis/v0.1.2.icechunk")
    ('dynamical-noaa-gefs', 'noaa-gefs-analysis/v0.1.2.icechunk')
    >>> parse_s3_path("dynamical-noaa-gefs/some/prefix/")
    ('dynamical-noaa-gefs', 'some/prefix')
    """
    p = path.strip()
    if p.startswith("s3://"):
        u = urlparse(p)
        bucket, prefix = u.netloc, u.path.lstrip("/")
    else:
        # bare "bucket/prefix..." form
        p = p.lstrip("/")
        bucket, _, prefix = p.partition("/")
    bucket = bucket.strip("/")
    prefix = prefix.strip("/")
    if not bucket:
        raise ValueError(f"could not parse a bucket from path: {path!r}")
    if not prefix:
        # Icechunk requires a prefix; a store at bucket root is unusual but allow caller to know
        raise ValueError(
            f"no prefix found in {path!r}; Icechunk stores live under a prefix "
            "(e.g. s3://bucket/path/to/store.icechunk)"
        )
    return bucket, prefix


def icechunk_storage(
    path: str,
    *,
    region: str | None = None,
    anonymous: bool = True,
    endpoint_url: str | None = None,
    **storage_kwargs: Any,
) -> "ic.Storage":
    """Build an icechunk Storage from an s3:// path.

    region=None      -> auto-resolved on AWS S3 (recommended for AWS).
    anonymous=True   -> no credentials (public read). Set False to use the
                        credential chain (env/profile/role). NOTE: authenticated
                        != writable; storage IAM decides what you may do.
    endpoint_url     -> set for S3-compatible stores (Pawsey Acacia, R2, MinIO).
                        For these, region is typically a placeholder or ignored.
    """
    bucket, prefix = parse_s3_path(path)
    kwargs: dict[str, Any] = dict(
        bucket=bucket,
        prefix=prefix,
        anonymous=anonymous,
    )
    if region is not None:
        kwargs["region"] = region
    if endpoint_url is not None:
        # ic.s3_storage accepts endpoint_url for compatible stores; name may vary
        # by version (endpoint_url vs endpoint) — adjust if your icechunk differs.
        kwargs["endpoint_url"] = endpoint_url
    kwargs.update(storage_kwargs)
    return ic.s3_storage(**kwargs)


def open_repo(path: str, **storage_kwargs: Any) -> "ic.Repository":
    """Open (not create) an Icechunk repository from an s3:// path."""
    return ic.Repository.open(icechunk_storage(path, **storage_kwargs))


def open_icechunk(
    path: str,
    *,
    branch: str = "main",
    tag: str | None = None,
    snapshot: str | None = None,
    writable: bool = False,
    region: str | None = None,
    anonymous: bool = True,
    endpoint_url: str | None = None,
    open_kwargs: dict[str, Any] | None = None,
    return_store: bool = False,
    **storage_kwargs: Any,
):
    """Open an Icechunk store from an s3:// path as an xarray.Dataset.

    One call replaces: ic.s3_storage(...) -> ic.Repository.open(...) ->
    repo.readonly_session(...) -> xr.open_dataset(session.store, engine="zarr", ...).

    Parameters
    ----------
    path : str
        s3://bucket/prefix... pointing at the Icechunk store root.
    branch, tag, snapshot :
        Which ref to read. Default branch="main". Pass exactly one of
        tag/snapshot to override (snapshot pins an exact commit).
    writable : bool
        Start a writable session instead of read-only. INDEPENDENT of `anonymous`:
        you can request writable while anonymous, but the commit will fail at the
        storage layer unless your identity has PutObject rights. Default False.
    region, anonymous, endpoint_url :
        See icechunk_storage(). region=None auto-resolves on AWS.
    open_kwargs : dict
        Extra kwargs passed to xr.open_dataset (merged over the defaults
        engine="zarr", consolidated=False, chunks={}).
    return_store : bool
        If True, return (dataset, session) so you can keep the session (needed
        to commit when writable=True). Otherwise return just the dataset.

    Returns
    -------
    xarray.Dataset, or (xarray.Dataset, session) if return_store=True.
    """
    if sum(x is not None for x in (tag, snapshot)) > 1:
        raise ValueError("pass at most one of tag= or snapshot=")

    repo = open_repo(
        path,
        region=region,
        anonymous=anonymous,
        endpoint_url=endpoint_url,
        **storage_kwargs,
    )

    # Resolve the ref into a session. Method spelling has varied across icechunk
    # versions; this targets the readonly_session(branch=...)/(tag=...)/(snapshot=...)
    # and writable_session(branch=...) shape. Adjust if your version differs.
    if writable:
        if tag is not None or snapshot is not None:
            raise ValueError("cannot open a writable session on a tag or snapshot; use a branch")
        session = repo.writable_session(branch)
    else:
        if snapshot is not None:
            session = repo.readonly_session(snapshot_id=snapshot)
        elif tag is not None:
            session = repo.readonly_session(tag=tag)
        else:
            session = repo.readonly_session(branch)

    defaults = dict(engine="zarr", consolidated=False, chunks={})
    if open_kwargs:
        defaults.update(open_kwargs)
    ds = xr.open_dataset(session.store, **defaults)

    if return_store:
        return ds, session
    return ds


if __name__ == "__main__":
    # smoke test against a public store (anonymous, region auto-resolved)
    ds = open_icechunk("s3://dynamical-noaa-gefs/noaa-gefs-analysis/v0.1.2.icechunk")
    print(ds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment