Skip to content

Instantly share code, notes, and snippets.

View mdsumner's full-sized avatar

Michael Sumner mdsumner

  • Integrated Digital East Antarctica, Australian Antarctic Division
  • Hobart, Australia
View GitHub Profile

Please, I think I'm doing something wrong here. With open_virtual_mfdataset( , parallel = False, ) this runs in about 2 minutes.

With open_virtual_mfdataset( , parallel = ThreadPoolExecutor, , ) it takes the same length of time (system with 32 cpus).

import xarray as xr
from obstore.store import from_url
import time
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset

I was delighted to learn that flexible coordinates in xarray are fully operational.

Following on from this (unfinished) blog post on unnecessary netcdf longlat coordinates, I now have a neat xarray/netcdf native fix for the problem I was discussing, it is a direct match to the "parse_coordinates": False example in the xarray indexes gallery.

This only a few weeks ago was challenging for me to present to non-R audiences, so I'm very excited that we can now fix these broken grids in a way more accessible to Python communities. Degenerate coordinates represent a huge entropy problem in array metadata, and (I think) we need a coordinated effort to be able to easily assign a fix

(like in GDAL with vrt://{dsn}?a_gt=c,a,b,f,d,e) and have that knowledge also feed back up to the providers to stem the flow of problematic coordinates.

mk_mth <- function(x) {
  ## replace daily with month
  xx <- stringr::str_replace(x, "daily", "month")
  ## replace _YEAR with mth_YEAR
  stringr::str_replace(xx, "(_[0-9]{4})", "_mth\\1")
}
import xarray
ds = xarray.open_dataset("s3://aodn-cloud-optimised/satellite_chlorophylla_oci_1day_aqua.zarr", 
                         engine = "zarr", storage_options = {"anon": True}, chunks = {})

## then we can do stuff like this, parallelized nicely with dask
#mn = ds.sel(longitude = slice(109, 164), latitude = slice(-42, -48), time = slice("2002-07-01", "2003-06-30")).groupby("time.month").mean()

In R just do

arrow::read_parquet("https://github.com/mdsumner/dryrun/raw/refs/heads/main/data-raw/noaa_oi_025_degree_daily_sst_avhrr.parquet")$url

In python, I understand GDAL

from osgeo import ogr
import os
import rasterio

dsn = "/vsis3/idea-10.7289-v5sq8xb5/www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc"
os.environ["AWS_S3_ENDPOINT"] = "projects.pawsey.org.au"
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
os.environ["AWS_VIRTUAL_HOSTING"] = "YES"

Install sooty

#install.packages("remotes")
remotes::install_cran("sooty")

Read the latest ice data we have for antarctica-amsr2-asi-s3125.