Skip to content

Instantly share code, notes, and snippets.

@betolink
Last active August 28, 2023 20:57
Show Gist options
  • Select an option

  • Save betolink/4f3fc3c0b04286ece8434d85dab0b932 to your computer and use it in GitHub Desktop.

Select an option

Save betolink/4f3fc3c0b04286ece8434d85dab0b932 to your computer and use it in GitHub Desktop.
earthaccess-kerchunk-store
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "9428b756-1f62-43ae-8ecf-34b361bfcbca",
"metadata": {},
"source": [
"## On-demand Zarr Stores for NASA datasets with `earthaccess` and `Kerchunk`\n",
"\n",
"<span><img src=\"https://user-images.githubusercontent.com/717735/263752096-d5af4c79-dd6f-47fb-9036-75202ad56650.png\" width=\"100px!\"/><img src=\"https://raw.githubusercontent.com/fsspec/kerchunk/main/kerchunk.png\" width=\"200px\"/></span>\n",
"\n",
"\n",
"\n",
"The idea behind this [PR](https://github.com/nsidc/earthaccess/pull/278) from James Borbeau on *earthaccess* is that we can combine *earthaccess*, the power of Dask and kerchunk to create consolidated refenrece files (zarr compatible) from NASA datasets. This method works best with gridded data as they can be combined by time using the same grid. \n",
"\n",
"Notes:\n",
"* Looks like the resulting consolidated store has coordinate encoding issues for some datasets, as this [study form the HDF Group](https://github.com/hyoklee/kerchunk) notes, Kerchunk is still on an early phase and doesn't support all the features of HDF5. \n",
"* Lucas Sterzinger notes that further optimizations are possible for big datasets.\n",
"* Having a distributed cluster means that we could scale trhis approach and create on-demand Zarr views of NASA data.\n",
"A more detailed description of what Kerchunk buys us can be found on this [notebook from Lucas](https://github.com/lsterzinger/2022-esip-kerchunk-tutorial/blob/main/01-Create_References.ipynb).\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3b73b640-8287-40fb-8eec-91359acf307d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%capture\n",
"!pip uninstall -y earthaccess\n",
"!pip install git+https://github.com/jrbourbeau/earthaccess.git@kerchunk"
]
},
{
"cell_type": "markdown",
"id": "3f2ba565-bc68-4d5e-8258-5b6349c02689",
"metadata": {},
"source": [
"### Example with SSTS, gridded global NetCDF \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36e434b5-bb7d-42e0-b999-9da7734ce204",
"metadata": {},
"outputs": [],
"source": [
"import earthaccess\n",
"import xarray as xr\n",
"from dask.distributed import LocalCluster\n",
"\n",
"if __name__ == \"__main__\":\n",
"\n",
" # Authenticate my machine with `earthaccess`\n",
" earthaccess.login()\n",
"\n",
" # Retrieve data files for the dataset I'm interested in\n",
" short_name = \"SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205\"\n",
" granuales = earthaccess.search_data(\n",
" short_name=short_name,\n",
" cloud_hosted=True,\n",
" temporal=(\"1990\", \"2019\"),\n",
" count=10, # For demo purposes\n",
" )\n",
"\n",
" # Create a local Dask cluster for parallel metadata consolidation\n",
" # (but works with any Dask cluster)\n",
" cluster = LocalCluster()\n",
" client = cluster.get_client()\n",
"\n",
" # Save consolidated metdata file\n",
" outfile = earthaccess.consolidate_metadata(\n",
" granuales,\n",
" outfile=f\"./{short_name}-metadata.json\", # Writing to a local file for demo purposes\n",
" # outfile=f\"s3://my-bucket/{short_name}-metadata.json\", # We could also write to a remote file\n",
" access=\"indirect\",\n",
" kerchunk_options={\"concat_dims\": \"Time\"}\n",
" )\n",
" print(f\"Consolidated metadata written to {outfile}\")\n",
"\n",
" # Load the dataset using the consolidated metadata file\n",
" fs = earthaccess.get_fsspec_https_session()\n",
" ds = xr.open_dataset(\n",
" \"reference://\",\n",
" engine=\"zarr\",\n",
" chunks={},\n",
" backend_kwargs={\n",
" \"consolidated\": False,\n",
" \"storage_options\": {\n",
" \"fo\": outfile,\n",
" \"remote_protocol\": \"https\",\n",
" \"remote_options\": fs.storage_options,\n",
" }\n",
" },\n",
" )\n",
"\n",
" result = ds.SLA.mean({\"Latitude\", \"Longitude\"}).compute()\n",
" print(f\"{result = }\")"
]
},
{
"cell_type": "markdown",
"id": "b2bc7f3c-b8d8-43b5-9954-7e1a8647b804",
"metadata": {},
"source": [
"### Using Chelle's [Tutorial for MUR SST on AWS](https://github.com/pangeo-gallery/osm2020tutorial/blob/master/AWS-notebooks/aws_mur_sst_tutorial_long.ipynb) as reference to build a Zarr store from 10 years of monthly data from MUR."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cfcc55ce-0baf-4d80-97c5-18dafb5d4207",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)\n",
"You're now authenticated with NASA Earthdata Login\n",
"Using token with expiration date: 10/23/2023\n",
"Using .netrc file for EDL\n",
"Searching for granules on 2012\n",
"Granules found: 31\n",
"Searching for granules on 2013\n",
"Granules found: 31\n",
"Searching for granules on 2014\n",
"Granules found: 31\n",
"Searching for granules on 2015\n",
"Granules found: 31\n",
"Searching for granules on 2016\n",
"Granules found: 31\n",
"Searching for granules on 2017\n",
"Granules found: 31\n",
"Searching for granules on 2018\n",
"Granules found: 31\n",
"Searching for granules on 2019\n",
"Granules found: 31\n",
"Searching for granules on 2020\n",
"Granules found: 31\n",
"Searching for granules on 2021\n",
"Granules found: 31\n",
"Total granules to process: 310\n",
"Consolidated metadata written to ./direct-MUR-metadata.json\n"
]
}
],
"source": [
"if __name__ == \"__main__\":\n",
"\n",
" # Authenticate my machine with `earthaccess`\n",
" earthaccess.login()\n",
" \n",
" doi = \"10.5067/GHGMR-4FJ04\"\n",
" short_name = \"MUR\"\n",
" month = 7\n",
" \n",
" results = []\n",
" \n",
" for year in range(2012,2022):\n",
" \n",
" params = {\n",
" \"doi\": doi,\n",
" \"cloud_hosted\": True,\n",
" \"temporal\": (f\"{str(year)}-{str(month)}-01\", f\"{str(year)}-{str(month)}-31\"),\n",
" \"count\": 31\n",
" }\n",
"\n",
" # Retrieve data files for the dataset I'm interested in\n",
" print(f\"Searching for granules on {year}\")\n",
" granules = earthaccess.search_data(**params)\n",
" results.extend(granules)\n",
" print(f\"Total granules to process: {len(results)}\")\n",
"\n",
" # Create a local Dask cluster for parallel metadata consolidation\n",
" # (but works with any Dask cluster)\n",
" cluster = LocalCluster()\n",
" client = cluster.get_client()\n",
"\n",
" # Save consolidated metdata file\n",
" outfile = earthaccess.consolidate_metadata(\n",
" results,\n",
" outfile=f\"./direct-{short_name}-metadata.json\", # Writing to a local file for demo purposes\n",
" # outfile=f\"s3://my-bucket/{short_name}-metadata.json\", # We could also write to a remote file\n",
" access=\"direct\",\n",
" # kerchunk_options={\"coo_map\": []}\n",
" kerchunk_options={\"concat_dims\": \"time\"}\n",
" )\n",
" print(f\"Consolidated metadata written to {outfile}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9bfcfa7-58b2-4fe9-95d2-44b2b352ff94",
"metadata": {},
"outputs": [],
"source": [
"earthaccess.login()\n",
"\n",
"fs = earthaccess.get_s3fs_session(\"GES_DISC\")\n",
"\n",
"ds = xr.open_dataset(\n",
" \"reference://\",\n",
" engine=\"zarr\",\n",
" chunks={},\n",
" decode_coords=False, # tricky, the coords are there but encoded in a way xarray can't decode for some reason. Similar to https://github.com/fsspec/kerchunk/issues/177\n",
" backend_kwargs={\n",
" \"consolidated\": False,\n",
" \"storage_options\": {\n",
" \"fo\": \"direct-MUR-metadata.json\",\n",
" \"remote_protocol\": \"s3\",\n",
" \"remote_options\": fs.storage_options,\n",
" }\n",
" },\n",
")\n",
"ds"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment