Skip to content

Instantly share code, notes, and snippets.

@omad
Last active January 15, 2018 06:07
Show Gist options
  • Save omad/306ab341fb8db6c8f34f6cf39c98ef9b to your computer and use it in GitHub Desktop.
Save omad/306ab341fb8db6c8f34f6cf39c98ef9b to your computer and use it in GitHub Desktop.

Reverting DriverManager and implementing alternatives

Reason for Pull Request

Although initially the DriverManager code seemed to address the need for extensibility in Data Cube. Upon trying to use it in the NCI/GA deployment we discovered a few show-stopper issues including:

  • Passing the DriverManager around everywhere broke the distributed bulk processing.
  • Database connections were being created unnecessarily in workers which simply needed to load pre-specified data.
  • While working with files stored on S3, we ran into conflicts with the S3+Block driver which was registered to handle the s3:// protocol.

As a result of these, and plenty of diversions on other things, we've had a 6+ month fork between datacube-core/develop and datacube-core/release-1.5 which GA has been maintaining in support of their production environment.

In early December @omad @petewa @rtaib and @Kirill888 discussed some potential solutions. These discussions are documented in github discussions and in notes from a videoconference meeting. Since then @omad and @Kirill888 have been working on implementing the proposed changes.

This Pull Request is our proposed implementation.

Support for Plug-in drivers

A light weight implementation of a driver loading system has been implemented in datacube.drivers.driver_cache, which uses setuptools dynamic service and plugin discovery mechanism to name and define available drivers. This code caches the available drivers in the current environment, and allows them to be loaded on demand, as well as handling any failures due to missing dependencies or other environment issues.

Index Plug-ins

Entry point group: datacube.plugins.index

A connection to an Index is required to find data in the Data Cube. Already implemented in the develop branch was the concept of environments which are a named set of configuration parameters used to connect to an Index. This PR extends this with an index_driver parameter, which specifies the name of the Index Driver to use. If this parameter is missing, it falls back to using the default PostgreSQL Index.

Default Implementation

The default Index uses a PostgreSQL database for all storage and retrieval.

S3 Extensions

Implementation: datacube/drivers/s3block_index/index.py

The S3+Block driver subclasses the default PostgreSQL Index with support for saving additional data about the size and shape of chunks stored in S3 objects. As such, it implements an identical interface, while overriding the dataset.add() method to save the additional data.

Data Read Plug-ins

Entry point group: datacube.plugins.io.read.

Read plug-ins are specified as supporting particular uri protocols and formats, both of which are fields available on existing Datasets

A ReadDriver returns a DataSource implementation, which is chosen based on:

  • Dataset URI (protocol part, eg: s3+block://)
  • Dataset format. As stored in the Data Cube Dataset.
  • Current system settings
  • Available IO plugins

If no specific DataSource can be found, a default RasterDatasetDataSource is returned, which uses rasterio to read from the local file system or a network resource.

The DataSource maintains the same interface as before, which works at the individual dataset+time+band level for loading data. This is something to be addressed in the future.

Example code to implement a reader driver

def init_reader_driver():
    return AbstractReaderDriver()

class AbstractReaderDriver(object):
    def supports(self, protocol: str, fmt: str) -> bool:
        pass
    def new_datasource(self, dataset, band_name) -> DataSource:
        pass

class AbstractDataSource(object):  # Same interface as before
    ...

S3 Driver

URI Protocol: s3+block://
Dataset Format: s3block
Implementation location: datacube/drivers/s3/driver.py

Example Pickle Based Driver

Available in /examples/io_plugin. Includes an example setup.py as well as an example Read and Write Drivers.

Data Write Plug-ins

Entry point group: datacube.plugins.io.write

Are selected based on their name. The storage.driver field has been added to the ingestion configuration file which specifies the name of the write driver to use. Drivers can specify a list of names that they can be known by, as well as publicly defining their output format, however this information isn't used by the ingester to decide which driver to use. Not specifying a driver counts as an error, there is no default.

At this stage there is no decision on what sort of a public API to expose, but the write_dataset_to_storage() method implemented in each driver is the closest we've got. The ingester is using it to write data.

Example code to implement a writer driver

def init_writer_driver():
    return AbstractWriterDriver()

class AbstractWriterDriver(object):
    @property
    def aliases(self):
        return []  # List of names this writer answers to

    @property
    def format(self):
        return ''  # Format that this writer supports

    def write_dataset_to_storage(self, dataset, filename,
                                 global_attributes=None,
                                 variable_params=None,
                                 storage_config=None,
                                 **kwargs):
        ...
        return {}  # Can return extra metadata to be saved in the index with the dataset

S3 Writer Driver

Name: s3block
Protocol: s3+block
Format: s3block
Implementation: datacube/drivers/s3/driver.py

NetCDF Writer Driver

Name: netcdf, NetCDF CF
Format: NetCDF
Implementation: datacube/drivers/netcdf/driver.py

Other Changes

Removed 3D Ingestion feature

We've decided to revert the changes to datacube ingest which were added to support ingesting to a 3D chunk in S3. We know this is an essential feature for the S3 Block storage system, but would prefer it to be implemented as a separate command. Our issue is that it doesn't support incremental update of datasets when they have been added or changed.

Being able to incrementally add or change datasets and then ingest them is vital for the NCI/GA implementation of Data Cube. This is the reason we have separate tools for ingest (which deals with a single dataset at a single time, and so works fine with incremental updates) and stack which is responsible for taking a period of time and re-storing it in deep-time storage units.

Being able to update storage blocks involves all sorts of thorny issues, and the simple implementation didn't address any of them, which could lead to confusion.

Protocol name change from s3://

We have renamed the protocol used for the s3 driver to s3+block://.

We're starting to use files stored in s3, which is supported by many tools out of the box using the standard s3:// protocol name.

Changes when specifying the environment

Added index_driver parameter

Change to Ingestion Configuration

Must now specify the Write Driver to use. For s3 ingestion there was a top level container specified, which has been renamed and moved under storage. The entire storage section is passed through to the Write Driver, so drivers requiring other configuration can include them here. eg:

...
storage:
  ...
  driver: s3block
  bucket: my_s3_bucket
...

References and History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment