Skip to content

Instantly share code, notes, and snippets.

View mdsumner's full-sized avatar

Michael Sumner mdsumner

  • Integrated Digital East Antarctica, Australian Antarctic Division
  • Hobart, Australia
View GitHub Profile

With R, this is just proving to myself that I can write all the local-path references.

Ran on 128 cores on HPC it did about 3200 of these files (it's still going, there are 8000+ days since 2002 July 01. Virtualizing netcdf is insanely efficient, but the details of how to auth those remote links still beyond me (which is what I'm working towards understanding). The scratch path below on my lustre storage would be replace by the base url and its child date structure:

https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1

They look like this here without the shard structure.

library(terra)
a <- vect(cbind(c(0, 1), c(2, 3, 4)), type = "l")

## imagine we have an independently specificed raster, and this line falls in it somewhere
r <- extend(rast(a, res = .05), 1)

(cell <- cells(r, a))

##so we can do 

On this site is a lot of ZARR datasets, ARCO (analysis ready cloud optimized).

https://catalog.leap.columbia.edu/

There are example codes that show how to open in xarray. We can do this is in R by loading up the GDAL api via the Python osgeo.gdal package.

The new version of reticulate makes this easy to do.

Create a 2x resolution GEBCO Zarr, I don't think this is well parallelized maybe because rioxarray doesn't distribute the reads amongst multiple open datasets?

import xarray
dsn = "/vsicurl/https://projects.pawsey.org.au/idea-gebco-tif/GEBCO_2024.tif"
ds = xarray.open_dataset(f'vrt://{dsn}?outsize=200%,200%', engine  = "rasterio", chunks = {"y":1024, "x": 1024})
ds.to_zarr("gebco2x.zarr", zarr_format = 3)
https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip
library(reticulate)
py_require("gdal")
gdal <- import("osgeo.gdal")
gdal$UseExceptions()

f <- "/vsigs/gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3"
Sys.setenv("GS_NO_SIGN_REQUEST" = "YES")
AWS_REGION='region-is-not-set' al repo create AustralianAntarcticDivision/myrepo
⠴ Creating repo AustralianAntarcticDivision/myrepo...
✓ Creating repo AustralianAntarcticDivision/myrepo...failed
  [31m×[0m error listing objects in object store dispatch failure
  [31m│[0m
  [31m│[0m context:
  [31m│[0m    0: icechunk::repository::create

https://rstats.me/@mdsumner/114103904777240036

I've used HDF5 here because NETCDF can't do this on windows. On linux probably best to use "NETCDF:{dsn}:time".

(r <- terra::rast("HDF5:\"/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc\"://time"))
class       : SpatRaster 
dimensions  : 1, 1, 1  (nrow, ncol, nlyr)
resolution : 1, 1 (x, y)