Please, I think I'm doing something wrong here. With open_virtual_mfdataset( , parallel = False, ) this runs in about 2 minutes.
With open_virtual_mfdataset( , parallel = ThreadPoolExecutor, , ) it takes the same length of time (system with 32 cpus).
import xarray as xr
from obstore.store import from_url
import time
from virtualizarr import open_virtual_dataset, open_virtual_mfdatasetI was delighted to learn that flexible coordinates in xarray are fully operational.
Following on from this (unfinished) blog post on unnecessary netcdf longlat coordinates, I now have a neat xarray/netcdf native fix for the problem I was discussing, it is a direct match to the "parse_coordinates": False example in the xarray indexes gallery.
This only a few weeks ago was challenging for me to present to non-R audiences, so I'm very excited that we can now fix these broken grids in a way more accessible to Python communities. Degenerate coordinates represent a huge entropy problem in array metadata, and (I think) we need a coordinated effort to be able to easily assign a fix
(like in GDAL with vrt://{dsn}?a_gt=c,a,b,f,d,e) and have that knowledge also feed back up to the providers to stem the flow of problematic coordinates.
mk_mth <- function(x) {
## replace daily with month
xx <- stringr::str_replace(x, "daily", "month")
## replace _YEAR with mth_YEAR
stringr::str_replace(xx, "(_[0-9]{4})", "_mth\\1")
}import xarray
ds = xarray.open_dataset("s3://aodn-cloud-optimised/satellite_chlorophylla_oci_1day_aqua.zarr",
engine = "zarr", storage_options = {"anon": True}, chunks = {})
## then we can do stuff like this, parallelized nicely with dask
#mn = ds.sel(longitude = slice(109, 164), latitude = slice(-42, -48), time = slice("2002-07-01", "2003-06-30")).groupby("time.month").mean()https://sciences.social/@mortenfrisch/115228046285688665
library(terra)
#> terra 1.8.64
xysize <- 100
nr <- 10
nc <- 20
r <- rast(ext(0, nc * xysize, 0, nr * xysize), res = xysize/2)
# ## xmin, xmax, ymin, ymax (the outer edges of the left, right, bottom, top cells)In R just do
arrow::read_parquet("https://github.com/mdsumner/dryrun/raw/refs/heads/main/data-raw/noaa_oi_025_degree_daily_sst_avhrr.parquet")$urlIn python, I understand GDAL
from osgeo import ogrimport os
import rasterio
dsn = "/vsis3/idea-10.7289-v5sq8xb5/www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc"
os.environ["AWS_S3_ENDPOINT"] = "projects.pawsey.org.au"
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
os.environ["AWS_VIRTUAL_HOSTING"] = "YES"All working from here https://virtualizarr.readthedocs.io/en/stable/usage.html (with HTTP tab)
## on the docker image
## reticulate::use_python("/workenv/bin/python3.12")
library(bowerbird)
bl <- bb_source(Install sooty
#install.packages("remotes")
remotes::install_cran("sooty")Read the latest ice data we have for antarctica-amsr2-asi-s3125.