Skip to content

Instantly share code, notes, and snippets.

@martindurant
Created July 10, 2018 21:06
Show Gist options
  • Save martindurant/bc38d581cd3f9d444a656cbceae9d8ba to your computer and use it in GitHub Desktop.
Save martindurant/bc38d581cd3f9d444a656cbceae9d8ba to your computer and use it in GitHub Desktop.
sources:
glob_source:
description: glob of files
driver: csv
cache:
- argkey: urlpath
regex: 's3://bucket'
sub: {{ CACHE_DIR }}
args:
urlpath: 's3://bucket/example*.csv'
single_file:
description: a file that can't be read directly from remote
driver: netcdf
cache:
- argkey: urlpath
regex: 's3://bucket'
sub: {{ CACHE_DIR }}
required: true
args:
urlpath: 's3://bucket/data.nc'
chunks: {x: 50}
nested:
description: known data tree
driver: parquet
cache:
- argkey: urlpath
regex: 's3://bucket/data.parquet'
sub: {{ CACHE_DIR }}
files:
- '_metadata'
- '*/cat*/part.*.parquet'
args:
urlpath: 's3://bucket/data.parquet'
complex:
description: any number of levels
driver: zarr
cache:
- argkey: urlpath
regex: 'gcs://bucket/mydata.zarr'
depth: 3 # levels of globbing to try
sub: {{ CACHE_DIR }}
args:
urlpath: 's3://bucket/data.zarr'
@martindurant
Copy link
Author

sub: {{ CACHE_DIR }} could just be the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment