Skip to content

Instantly share code, notes, and snippets.

@mdsumner
Last active October 10, 2025 09:32
Show Gist options
  • Select an option

  • Save mdsumner/8981bc3309ab1207560159a73ffdeaec to your computer and use it in GitHub Desktop.

Select an option

Save mdsumner/8981bc3309ab1207560159a73ffdeaec to your computer and use it in GitHub Desktop.

Please, I think I'm doing something wrong here. With open_virtual_mfdataset( , parallel = False, ) this runs in about 2 minutes.

With open_virtual_mfdataset( , parallel = ThreadPoolExecutor, , ) it takes the same length of time (system with 32 cpus).

import xarray as xr
from obstore.store import from_url
import time
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
from concurrent.futures import ThreadPoolExecutor
bucket = "s3://idea-10.7289-v5sq8xb5"
path = ["www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810904.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810905.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810906.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810907.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810908.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810909.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810910.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810911.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810912.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810913.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810914.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810915.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810916.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810917.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810918.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810919.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810920.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810921.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810922.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810923.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810924.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810925.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810926.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810927.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810928.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810929.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810930.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811001.nc", 
"www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811002.nc"]

url = [f"{bucket}/{p}" for p in path]
store = from_url(bucket, region="", endpoint = "https://projects.pawsey.org.au", skip_signature=True)
registry = ObjectStoreRegistry({bucket: store})
parser  = HDFParser()

s = time.time()

mds = open_virtual_mfdataset(
    url,
    parser=parser,
    registry=registry,
    combine="nested",
    concat_dim="time",
    parallel = ThreadPoolExecutor, ##  or False, doesn't change anything timing wise
    drop_variables = ["zlev"],
    loadable_variables = ["lon", "lat", "time"]
  )
e = time.time()
print(e - s)

I’ve tried all kinds of direct deployment of ThreadPoolExecutor, so perhaps having concurrent.futures in VirtualiZarr as an option is not valid, or perhaps is a placeholder for a GIL-less time? That’s been my conclusion for a while, I just thought I was doing it wrong. (as I said I’ve tried picking it apart from many directions)

As you can see lower down I can achieve faster performance in a deployment not subject to the GIL (it’s still python doing the work in each thread, but more limited in terms of how I can collate the result).

@mdsumner
Copy link
Author

Now, same thing but run it from R - this is an async problem - it's the same code, just R's mirai is handling the parallelization. It's 8x faster.

library(reticulate)

file_url <- c('file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_06.nc', 'file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_07.nc', 'file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_08.nc', 'file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_09.nc', 'file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_10.nc', 'file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_11.nc', 'file:///g/data/gb6/BRAN/BRAN2023/daily/ocean_salt_2023_12.nc')

library(mirai)
library(purrr)
n <- 32
daemons(n)
fun <- in_parallel(function(.file_url) {
  reticulate::use_python("/workenv/bin/python3")
  virtualizarr <- reticulate::import("virtualizarr")
  parser <- virtualizarr$parsers$HDFParser()
  
  bucket <- "s3://idea-10.7289-v5sq8xb5"
  obstore <- reticulate::import("obstore")
  store = obstore$store$from_url(bucket, region="", endpoint = "https://projects.pawsey.org.au", skip_signature=TRUE)
  
  registry <- virtualizarr$registry$ObjectStoreRegistry(setNames(list(store), bucket))
  
  loadvars = c("lon", "lat", "time")
  dropvars = c("zlev")
  
  
  ds = virtualizarr$open_virtual_dataset(
    .file_url,
    parser=parser,
    registry=registry,
    loadable_variables = loadvars, 
    drop_variables = as.list(dropvars))
  
  dill <- reticulate::import("dill")
  bytes <- dill$dumps(ds)
  as.raw(bytes)
})


path <- c("www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810904.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810905.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810906.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810907.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810908.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810909.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810910.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810911.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810912.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810913.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810914.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810915.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810916.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810917.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810918.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810919.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810920.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810921.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810922.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810923.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810924.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810925.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810926.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810927.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810928.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810929.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810930.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811001.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811002.nc")

bucket <- "s3://idea-10.7289-v5sq8xb5"
url <- sprintf("%s/%s", bucket, path)
#fun(url[1])  ## we get bytes, that we can dill$loads() to a nice xarray again

system.time({
  x <- map(url, fun)
})
# user  system elapsed
# 0.007   0.012  14.457

## then loads and concat actually takes no time
dill <- import("dill")
lds <- lapply(x, dill$loads)
xarray <- import("xarray")
xarray$concat(lds, dim = "time")


/workenv/lib/python3.12/site-packages/zarr/codecs/numcodecs/_codecs.py:139: ZarrUserWarning: Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations.
  super().__init__(**codec_config)
<xarray.Dataset> Size: 265MB
Dimensions:  (time: 32, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) datetime64[ns] 256B 1981-09-01T12:00:00 ... 1981-10-02T12...
Dimensions without coordinates: zlev
Data variables:
    anom     (time, zlev, lat, lon) int16 66MB ManifestArray<shape=(32, 1, 72...
    err      (time, zlev, lat, lon) int16 66MB ManifestArray<shape=(32, 1, 72...
    ice      (time, zlev, lat, lon) int16 66MB ManifestArray<shape=(32, 1, 72...
    sst      (time, zlev, lat, lon) int16 66MB ManifestArray<shape=(32, 1, 72...
Attributes: (12/37)
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.19810901.nc
    naming_authority:           gov.noaa.ncei
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    cdm_data_type:              Grid
    ...                         ...
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR
    Conventions:                CF-1.6, ACDD-1.3
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...

@mdsumner
Copy link
Author

mdsumner commented Oct 8, 2025

UGH!!! for dask you have to create a client and it automagically just gets used. If you don't do this, no message or anything ...

import xarray as xr
from obstore.store import from_url
import time
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
from concurrent.futures import ThreadPoolExecutor
from distributed import Client

client = Client(n_workers=30)
client



bucket = "s3://idea-10.7289-v5sq8xb5"
path = ["www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810902.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810903.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810904.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810905.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810906.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810907.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810908.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810909.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810910.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810911.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810912.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810913.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810914.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810915.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810916.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810917.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810918.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810919.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810920.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810921.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810922.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810923.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810924.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810925.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810926.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810927.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810928.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810929.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810930.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811001.nc", 
        "www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198110/oisst-avhrr-v02r01.19811002.nc"]

url = [f"{bucket}/{p}" for p in path]
store = from_url(bucket, region="", endpoint = "https://projects.pawsey.org.au", skip_signature=True)
registry = ObjectStoreRegistry({bucket: store})
parser  = HDFParser()

s = time.time()

mds = open_virtual_mfdataset(
  url,
  parser=parser,
  registry=registry,
  combine="nested",
  concat_dim="time",
  parallel = "dask", ##  or False, doesn't change anything timing wise
  drop_variables = ["zlev"],
  loadable_variables = ["lon", "lat", "time"]
)
e = time.time()
print(e - s)

## 18 SECONDS hurrah

ffs

@mdsumner
Copy link
Author

stuff I probably won't share on the forum

thanks all, primarily I was trying to figure out why I couldn’t get any parallelism out of VirtualiZarr with concurrent.futures.

I don’t know the full answer, but it seems ineffective and I’ve given up on it, on the various systems I use it works with dask, and it works with {mirai} in exactly the use case I want it for (virtualizing netcdf from disk or Thredds). I don’t know how to “bypass the GIL” in this context, I can’t find any examples that do that for concurrent.futures.

(with dask you can’t just say parallel = “dask”, in open_virtual_mfdataset you are responsible for setting up the Client() prior to this even though it has no apparent connection to enabling the jobs - seems like this is an assumed idiom to some, just fyi for anyone reading along). With mirai, set up the in_parallel function, set daemons(ncpus) and run it with map() or any of the map family in {purrr}. I use dill.dumps/loads to serailize across the boundary, and concat.

I don’t have clean examples to point to while I’m still working on the read side of the project, but happy to help anyone interested. In time I’ll have explainers and examples. It works really well, driven by R or Python, in situ with disk netcdf, or remote with Thredds netcdf, but with subsequent read-use I can’t get reliable performance yet (here it seems with some tools at least I need to disable parallel, so that Thredds stays on the line - fun).

If anyone can find an example where ThreadPoolExecutor is deployed effectively for reading from files I’d love to see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment