Created
September 23, 2022 15:27
-
-
Save martindurant/88846ecbd863a4fa4696b9fffd01ca80 to your computer and use it in GitHub Desktop.
icechunk1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "c4a2f7a2", | |
"metadata": {}, | |
"source": [ | |
"### Iceberg + kerchunk =\n", | |
"\n", | |
"# IceChunk\n", | |
"\n", | |
"Kerchunk is ... https://fsspec.github.io/kerchunk/\n", | |
"\n", | |
"Apache Iceberg is versioned parquet datasets, by immutable files and \"manifest\" listings." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "eafbe9d9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import fsspec\n", | |
"import xarray as xr\n", | |
"import zarr" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6986acb7", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# s3://testfred/gridS.tar --endpoint-url https://object-store.cloud.muni.cz 20GB of netCDF4\n", | |
"s3 = {\n", | |
" \"anon\": True,\n", | |
" \"client_kwargs\": {\"endpoint_url\": \"https://object-store.cloud.muni.cz\"}\n", | |
"}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fcafdb6c", | |
"metadata": {}, | |
"source": [ | |
"### Innovation 1\n", | |
"indexed multiple file within a single remote TAR\n", | |
"\n", | |
"- can open file-like object in the remote for scanning\n", | |
"- can find the offsets to each of the member files" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "eabbc318", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with fsspec.open(\"tar://SEDNA-DELTA_y2014m01d01.1d_gridS.nc::s3://testfred/gridS.tar\", s3=s3) as f:\n", | |
" print(f.read(4))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "98635113", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import tarfile\n", | |
"with fsspec.open(\"s3://testfred/gridS.tar\", **s3) as tf:\n", | |
" tar = tarfile.TarFile(fileobj=tf)\n", | |
" offsets = {ti.name: ti.offset_data for ti in tar.getmembers()}\n", | |
"offsets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "5f3cf694", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"fs = fsspec.filesystem(\"reference\", fo=\"gridS.json\", remote_options=s3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "866c30c4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# one file\n", | |
"{l[0] for l in fs.references.values() if isinstance(l, list)}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "3e0f4e1c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"g = zarr.open_group(fs.get_mapper())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6c978185", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"g.vosaline.nbytes / 2**30 # apparent in-memory size" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "4d3d3565", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# 30180 * 32k chunks\n", | |
"g.vosaline.chunks, g.vosaline.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7d6806f9", | |
"metadata": {}, | |
"source": [ | |
"Data loads concurrently, unlike with TAR driver" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "92f58026", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"g.vosaline[:, 0, 3000, 3000], g.vosaline[:, 0, 3600, 3001]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d78077ee", | |
"metadata": {}, | |
"source": [ | |
"### Innovation 2\n", | |
"\n", | |
"Let's edit it! I do **not** have write access to the remote store." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "ce325d31", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"g.vosaline[:, 0, 3000, 3000] += 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "7e3f26e2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# created local file and updated references\n", | |
"fs.references[\"vosaline/0.0.230.0\"]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "c489081c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# save modified refs\n", | |
"fs.save_json(\"gridS-mod.json\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "13f0cf7a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"fs2 = fsspec.filesystem(\n", | |
" \"reference\", fo=\"gridS-mod.json\", \n", | |
" fss={\n", | |
" \"s3\": fsspec.filesystem(\"s3\", **s3),\n", | |
" \"file\": fsspec.filesystem(\"file\")\n", | |
" }\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "7705cf27", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# now we refer to one remote and four local files\n", | |
"{l[0] for l in fs2.references.values() if isinstance(l, list)}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "8324a088", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"g2 = zarr.open_group(fs2.get_mapper())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b01dc927", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# access both new and original data alongside\n", | |
"g2.vosaline[:, 0, 3000, 3000], g2.vosaline[:, 0, 3600, 3001]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "880b4dd2", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"fs = fsspec.filesystem(\"reference\", fo=\"gridS.json\", remote_options=s3, skip_instance_cache=True)\n", | |
"g = zarr.open_group(fs.get_mapper())\n", | |
"g.vosaline[:, 0, 3000, 3000], g.vosaline[:, 0, 3600, 3001]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2a1bd324", | |
"metadata": {}, | |
"source": [ | |
"Put it in an Intake catalog, and you have checkpoint/versioned data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b043d188", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.13" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Martin, I'm glad to see you digging into this problem.
As I mentioned over email, @jhamman and I are currently carefully writing a detailed design document around this idea. We will be sharing this within a few weeks. Until that is done, we will probably refrain from diving into much coding. I appreciate your patience.
This was not a call to action, just a POC for myself to see how simple it could be. The finished UX will be far more comprehensive! I don't intend to do more on this before seeing your architecture design.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Requires https://github.com/martindurant/filesystem_spec/tree/icy
Note that the two datasets could have been put into an intake catalog for loading with xarray. Xarray does not support updating part of a dataset, however.
cc @rabernat