Skip to content

Instantly share code, notes, and snippets.

View mdsumner's full-sized avatar

Michael Sumner mdsumner

  • Integrated Digital East Antarctica, Australian Antarctic Division
  • Hobart, Australia
View GitHub Profile

This url has a netcdf:

"https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811002.nc"

it has correct georeferencing (bbox 0, -90, 360, 90) but doesn't have the crs set ("EPSG:4326" is appropriate)

it also has multiple variables, which traditionally requires this verbose "subdataset syntax" for GDAL:

terra

terra::vect("/vsicurl/https://github.com/mdsumner/geoboundaries/releases/download/latest/geoBoundariesCGAZ_ADM0.parquet")
 class       : SpatVector
 geometry    : polygons
 dimensions  : 218, 3  (geometries, attributes)
 extent      : -180, 180, -90, 83.63339  (xmin, xmax, ymin, ymax)
 source      : geoBoundariesCGAZ_ADM0.parquet

/vsizip//vsicurl/https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.zip

/vsizip//vsicurl/listdata.thelist.tas.gov.au/opendata/data/LIST_PARCELS_HOBART.zip/list_parcels_hobart.shp

/vsizip//vsicurl/https://github.com/AustralianAntarcticDivision/rema.proc/raw/refs/heads/master/01_rock_classification/Medium_resolution_vector_polygons_of_Antarctic_rock_outcrop_-_VERSION_7.3.zip/Medium resolution vector polygons of Antarctic rock outcrop - VERSION 7.3/add_rock_outcrop_medium_res_polygon_v7.3.gpkg

Please, I think I'm doing something wrong here. With open_virtual_mfdataset( , parallel = False, ) this runs in about 2 minutes.

With open_virtual_mfdataset( , parallel = ThreadPoolExecutor, , ) it takes the same length of time (system with 32 cpus).

import xarray as xr
from obstore.store import from_url
import time
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset

I was delighted to learn that flexible coordinates in xarray are fully operational.

Following on from this (unfinished) blog post on unnecessary netcdf longlat coordinates, I now have a neat xarray/netcdf native fix for the problem I was discussing, it is a direct match to the "parse_coordinates": False example in the xarray indexes gallery.

This only a few weeks ago was challenging for me to present to non-R audiences, so I'm very excited that we can now fix these broken grids in a way more accessible to Python communities. Degenerate coordinates represent a huge entropy problem in array metadata, and (I think) we need a coordinated effort to be able to easily assign a fix

(like in GDAL with vrt://{dsn}?a_gt=c,a,b,f,d,e) and have that knowledge also feed back up to the providers to stem the flow of problematic coordinates.

mk_mth <- function(x) {
  ## replace daily with month
  xx <- stringr::str_replace(x, "daily", "month")
  ## replace _YEAR with mth_YEAR
  stringr::str_replace(xx, "(_[0-9]{4})", "_mth\\1")
}
import xarray
ds = xarray.open_dataset("s3://aodn-cloud-optimised/satellite_chlorophylla_oci_1day_aqua.zarr", 
                         engine = "zarr", storage_options = {"anon": True}, chunks = {})

## then we can do stuff like this, parallelized nicely with dask
#mn = ds.sel(longitude = slice(109, 164), latitude = slice(-42, -48), time = slice("2002-07-01", "2003-06-30")).groupby("time.month").mean()

In R just do

arrow::read_parquet("https://github.com/mdsumner/dryrun/raw/refs/heads/main/data-raw/noaa_oi_025_degree_daily_sst_avhrr.parquet")$url

In python, I understand GDAL

from osgeo import ogr