Although initially the DriverManager
code seemed to address the need for
extensibility in Data Cube. Upon trying to use it in the NCI/GA deployment
we discovered a few show-stopper issues including:
- Passing the
DriverManager
around everywhere broke the distributed bulk processing. - Database connections were being created unnecessarily in workers which simply needed to load pre-specified data.
- While working with files stored on S3, we ran into conflicts with the S3+Block
driver which was registered to handle the
s3://
protocol.
As a result of these, and plenty of diversions on other things, we've had a 6+ month fork between datacube-core/develop and datacube-core/release-1.5 which GA has been maintaining in support of their production environment.
In early December @omad @petewa @rtaib and @Kirill888 discussed some potential solutions. These discussions are documented in github discussions and in notes from a videoconference meeting. Since then @omad and @Kirill888 have been working on implementing the proposed changes.
This Pull Request is our proposed implementation.
A light weight implementation of a driver loading system has been implemented in
datacube.drivers.driver_cache
, which uses setuptools
dynamic service and
plugin discovery
mechanism
to name and define available drivers. This code caches the available drivers in
the current environment, and allows them to be loaded on demand, as well as
handling any failures due to missing dependencies or other environment issues.
Entry point group: datacube.plugins.index
A connection to an Index
is required to find data in the Data Cube. Already
implemented in the develop
branch was the concept of environments
which are
a named set of configuration parameters used to connect to an Index
. This PR
extends this with an index_driver
parameter, which specifies the name of the
Index Driver to use. If this parameter is missing, it falls back to using the
default PostgreSQL Index.
The default Index
uses a PostgreSQL database for all storage and retrieval.
Implementation: datacube/drivers/s3block_index/index.py
The S3+Block driver subclasses the default PostgreSQL Index with support for
saving additional data about the size and shape of chunks stored in S3 objects.
As such, it implements an identical interface, while overriding the
dataset.add()
method to save the additional data.
Entry point group: datacube.plugins.io.read
.
Read plug-ins are specified as supporting particular uri protocols and
formats, both of which are fields available on existing Datasets
A ReadDriver returns a DataSource
implementation, which is chosen based on:
- Dataset URI (protocol part, eg:
s3+block://
) - Dataset format. As stored in the Data Cube
Dataset
. - Current system settings
- Available IO plugins
If no specific DataSource
can be found, a default RasterDatasetDataSource
is
returned, which uses rasterio
to read from the local file system or a network
resource.
The DataSource
maintains the same interface as before, which works at the
individual dataset+time+band level for loading data. This is something to be
addressed in the future.
def init_reader_driver():
return AbstractReaderDriver()
class AbstractReaderDriver(object):
def supports(self, protocol: str, fmt: str) -> bool:
pass
def new_datasource(self, dataset, band_name) -> DataSource:
pass
class AbstractDataSource(object): # Same interface as before
...
URI Protocol: s3+block://
Dataset Format: s3block
Implementation location: datacube/drivers/s3/driver.py
Available in /examples/io_plugin
. Includes an example setup.py
as well as an
example Read and Write Drivers.
Entry point group: datacube.plugins.io.write
Are selected based on their name. The storage.driver
field has been added to
the ingestion configuration file which specifies the name of the write driver to
use. Drivers can specify a list of names that they can be known by, as well as
publicly defining their output format, however this information isn't used by
the ingester to decide which driver to use. Not specifying a driver counts as an
error, there is no default.
At this stage there is no decision on what sort of a public API to expose, but
the write_dataset_to_storage()
method implemented in each driver is the
closest we've got. The ingester is using it to write data.
def init_writer_driver():
return AbstractWriterDriver()
class AbstractWriterDriver(object):
@property
def aliases(self):
return [] # List of names this writer answers to
@property
def format(self):
return '' # Format that this writer supports
def write_dataset_to_storage(self, dataset, filename,
global_attributes=None,
variable_params=None,
storage_config=None,
**kwargs):
...
return {} # Can return extra metadata to be saved in the index with the dataset
Name: s3block
Protocol: s3+block
Format: s3block
Implementation: datacube/drivers/s3/driver.py
Name: netcdf
, NetCDF CF
Format: NetCDF
Implementation: datacube/drivers/netcdf/driver.py
We've decided to revert the changes to datacube ingest
which were added to
support ingesting to a 3D chunk in S3. We know this is an essential feature for
the S3 Block storage system, but would prefer it to be implemented as a separate
command. Our issue is that it doesn't support incremental update of datasets
when they have been added or changed.
Being able to incrementally add or change datasets and then ingest them is vital for the NCI/GA implementation of Data Cube. This is the reason we have separate tools for ingest (which deals with a single dataset at a single time, and so works fine with incremental updates) and stack which is responsible for taking a period of time and re-storing it in deep-time storage units.
Being able to update storage blocks involves all sorts of thorny issues, and the simple implementation didn't address any of them, which could lead to confusion.
We have renamed the protocol used for the s3 driver to s3+block://
.
We're starting to use files stored in s3, which is supported by many tools out
of the box using the standard s3://
protocol name.
Added index_driver
parameter
Must now specify the Write Driver to use. For s3 ingestion there was a top
level container
specified, which has been renamed and moved under storage
.
The entire storage
section is passed through to the Write Driver, so
drivers requiring other configuration can include them here. eg:
...
storage:
...
driver: s3block
bucket: my_s3_bucket
...
- Pluggable Back Ends Discussion [7 December 2017]
- Teleconference with @omad @petewa @rtaib @Kirill888 on 12 December 2017.
- Notes from ODC Storage and Index Driver Meeting