Skip to content

Instantly share code, notes, and snippets.

View mdsumner's full-sized avatar

Michael Sumner mdsumner

  • Integrated Digital East Antarctica, Australian Antarctic Division
  • Hobart, Australia
View GitHub Profile

/vsizip//vsicurl/https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.zip

/vsizip//vsicurl/listdata.thelist.tas.gov.au/opendata/data/LIST_PARCELS_HOBART.zip/list_parcels_hobart.shp

/vsizip//vsicurl/https://github.com/AustralianAntarcticDivision/rema.proc/raw/refs/heads/master/01_rock_classification/Medium_resolution_vector_polygons_of_Antarctic_rock_outcrop_-_VERSION_7.3.zip/Medium resolution vector polygons of Antarctic rock outcrop - VERSION 7.3/add_rock_outcrop_medium_res_polygon_v7.3.gpkg

Please, I think I'm doing something wrong here. With open_virtual_mfdataset( , parallel = False, ) this runs in about 2 minutes.

With open_virtual_mfdataset( , parallel = ThreadPoolExecutor, , ) it takes the same length of time (system with 32 cpus).

import xarray as xr
from obstore.store import from_url
import time
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset

I was delighted to learn that flexible coordinates in xarray are fully operational.

Following on from this (unfinished) blog post on unnecessary netcdf longlat coordinates, I now have a neat xarray/netcdf native fix for the problem I was discussing, it is a direct match to the "parse_coordinates": False example in the xarray indexes gallery.

This only a few weeks ago was challenging for me to present to non-R audiences, so I'm very excited that we can now fix these broken grids in a way more accessible to Python communities. Degenerate coordinates represent a huge entropy problem in array metadata, and (I think) we need a coordinated effort to be able to easily assign a fix

(like in GDAL with vrt://{dsn}?a_gt=c,a,b,f,d,e) and have that knowledge also feed back up to the providers to stem the flow of problematic coordinates.

mk_mth <- function(x) {
  ## replace daily with month
  xx <- stringr::str_replace(x, "daily", "month")
  ## replace _YEAR with mth_YEAR
  stringr::str_replace(xx, "(_[0-9]{4})", "_mth\\1")
}
import xarray
ds = xarray.open_dataset("s3://aodn-cloud-optimised/satellite_chlorophylla_oci_1day_aqua.zarr", 
                         engine = "zarr", storage_options = {"anon": True}, chunks = {})

## then we can do stuff like this, parallelized nicely with dask
#mn = ds.sel(longitude = slice(109, 164), latitude = slice(-42, -48), time = slice("2002-07-01", "2003-06-30")).groupby("time.month").mean()

In R just do

arrow::read_parquet("https://github.com/mdsumner/dryrun/raw/refs/heads/main/data-raw/noaa_oi_025_degree_daily_sst_avhrr.parquet")$url

In python, I understand GDAL

from osgeo import ogr
import os
import rasterio

dsn = "/vsis3/idea-10.7289-v5sq8xb5/www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc"
os.environ["AWS_S3_ENDPOINT"] = "projects.pawsey.org.au"
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
os.environ["AWS_VIRTUAL_HOSTING"] = "YES"