Tested with Kedro 0.17.1 and Hydra 1.0.6.
Author: Martin Sotir
Kedro is a lightweight library and project template designed for fast collaborative prototyping of data-driven pipelines. Kedro formalize project configuration (parameters, credentials, data catalog), pipeline graph execution and definition (from python functions) and provides an extensible entry point command line tool (python makefile-like).
Kedro can be used from a notebook, using a KedroContext
object to access the project configuration
and data catalog, or from the command line, to run registered pipelines (each defined as a list of
functions with specified inputs, outputs, and parameters).
Kedro encodes data science and data pipeline good pratices as code in the most minimal way. Each feature is kept as much simple and barebone as possible. Because of this "skeletal" shape, Kedro will not fit every research and production needs, but Kedro make it very easy to appropriate and extend it to your needs. For exemple, the kedro pipeline feature is way less featured than most production-ready solutions (Prefect, Dagster, Airflow, etc.) but as soon as the core principle are respected (the separation of concerns between data loading, processing and pipeline execution, the data engineering convention, etc.) adding features or transitionning to other technology will not be headache.
For theses reasons, I think Kedro is a good first step before building a custom in-house ML system or switching to more production-ready solution with all the bells and whistles (e.g.: MLrun, MetaFlow, Kubeflow or any combination of the booming MLOps tool ecosystem). I feel that Kedro is particularly suited for R&D teams working on a lot of industrial ml pipeline prototypes (e.g., in R&D service firms), aiming to reuse code, disseminate data science best practices and facilitate collaboration with product development teams, while beeing as much agnostic as possible of the technology to use in the end.
A typical kedro project configuration look like this:
── conf
├── base
│ ├── catalog.yml
│ ├── logging.yml
│ ├── parameters.yml
│ └── experiment1
│ └── parameters.yml
└── local
├── catalog.yml
├── credentials.yml
└── parameters.yml
-
Any number of environments can be created (though only two environments are loaded at the same time:
base
+ the active environment). -
Configuration files are loaded by a
ConfigLoader
class with a simple API: the configuration loader is first provided with a list root configuration paths, usually thebase
and the current active environment (local
by default). Then theConfigLoader
has a singleget(*patterns: List[str])
method wherepatterns
are used to match configuration files within the root directories. -
The minimal configuration structure is enforced by the
KedroContext
class. This class loadscatalog
,credentials
,logging
andparameters
configuration files using semi-flexible file patterns. For instance, when loading parameters theKedroContext
class usesconf_loader.get("parameters*", "parameters*/**", "**/parameters*")
: this matchparameters.yml
but alsoparameters.json
,experiment1/parameters.yml
,parameters/model1.toml
, etc. Files matched in aget
query are merged: duplicated keys in the same environment will raise a error, keys in the active environment take precedence over keys in thebase
environment. (Note: directory paths are not taken into accounts when loading the configuration, every configuration file is loaded as root-level config). -
catalog
,credentials
,logging
andparameters
are not merged, they are loaded at different times and often interpreted separately. Thelogging
configuration is loaded when aKedroSession
starts, thecatalog
andcredential
configuration are loaded when the dataset catalog is required (to run a pipeline),parameters
configuration are provided to pipeline 'nodes on-demand (each node specify which the part of the parameter configuration it needs). -
Usually, pipeline ‘nodes do not have a direct access to information in the
catalog
andcredential
configuration: this enforces a "separation of concerns" between nodes (concern: how to transform inputs into outputs) and catalog "dataset" entries (concern: how and where to load/write data). -
The configuration can be extended Kedro application and plugins has access to the
ConfigLoader
to load additional files. For instance the kedro-mflow plugin will load MLFlow configuration withconf_loader.get("mllfow*")
. -
Kedro config loader is based on anyconfig and can read .yaml, .json, .ini, .toml, xml (and more) configuration files. Kedro also provides a
TemplatedConfigLoader
, extending theConfigLoader
with Jinja templating syntax (string interpolation, loops, etc.). -
Parameters can be overridden from the command line when running pipeline with the
kedro run --params key=value
syntax or by passing adict
to theextra_params
parameter insesson.create(...)
(e.g. from a notebook). -
Kedro Project settings are not defined in the main conf
In Kedro, there is no dedicated feature to easily switch between configuration options for "sub-modules" in the configuration (e.g.: to switch between several predefined model or optimizer parameter sets). There are however several non-straightforward ways to achieve such mechanism.
First, we can leverage the configuration environment feature of Kedro. However, environments affect the whole configuration and there is no support for nested/hierarchical environment in Kedro.
The more generic way to approach this is usually to define mutually exclusive parameter sets in a
dictionary or a list, in the same yaml file or spread in several yaml files (but attention must be
paid to not create duplicated keys). An option is then selected dynamically at runtime, depending on
another parameter from the configuration or command line arguments (handled by the kedro run
command).
Example:
# File: parameters.yml
model: xgboost # default can be overrided by --params argument in the `kedro run` cmd
# File: parameters/models/xgboost.yml
xgboost:
n_estimators: 10000
learning_rate: 0.01
max_depth: 6
# File: parameters/models/randomforest.yml
randomforest:
n_estimators: 200
max_depth: 8
max_features: "sqrt"
When instiating a Pipeline (or within a Notebook):
def make_pipeline():
ctx = get_current_session().load_context()
model = ctx.params['model']
return Pipeline([
node(train,
inputs={'dataset': 'train_set', 'model_config': f"params:{model}"},
outputs=['report'])
])
def train(dataset, model_config):
...
(the syntax param:<param_key>
is used to pass parameters to kedro pipeline nodes).
An other option would be to take advantage of the Jinja templating capabilities of the
TemplatedConfigLoader
.
With a powerful configuration loader and auto-generated program entry point, Hydra focuses on increasing researcher productivity while encouraging experiment reproducibility.
Hydra seams targeted at researchers, and especially ML researchers who could run thousands of trials for the same task adjusting a set of hyperparameters each time. Hydra provide a powerful hierarchical configuration loader and automatically generate application entry points that allows scientists to override and discover hyperparameters.
Like Kedro, Hydra is extensible and offer functionality beyond configuration and entry points: remote runner, logging utilities, bridge with hyperparameter search tools, etc.
The 'scope' of Hydra is narrower and quite different from kedro: Hydra has no notion of data catalog, no pipelines, no credentials management, no imposed folder structure. This makes hydra more flexible and less cumbersome than Kedro for projects where those features are not needed. Hydra seems an excellent tool for experimentation and developing ML model in 'isolation', e.g,. for specialized benchmark, with stable and well-defined output/input data.
Hydra configuration are hierarchical, with a single root configuration file for each
application/entry-point (config.yaml
in the exemple below):
── conf
├── config.yaml
├── db
│ ├── mysql.yaml
│ └── postgresql.yaml
├── schema
│ ├── school.yaml
│ ├── support.yaml
│ └── warehouse.yaml
└── ui
├── full.yaml
└── view.yaml
Hydra configuration loader stands on OmegaConf (sharing the same author with Hydra). Omegaconf extends the yaml format with conventions and features designeds to simplify the definition of complex software configurations:
-
By default OmegConf loads yaml configuration into python built-in types (dicts and lists) but also provide experimental support to load "structered" config, where part of the configuration are instantiated with user-defined python classes.
-
OmegaConf provides variable references, and value interpolation features (having a parameter value, or part of a paramater value depends on an other paramater value). These are very useful to prevent repeating values while keeping the configuration well structured (for instance, in deeplearning, parameters like the batch size, number of epoch, learning rate are often used by multiple entitites: the training loop, optimizer, LR scheduler, etc.). String value interpolation is also useful to build experiment and run names from configuration parameters without any extra logic in the code. OmegaConf string interpolation can also retrieve system environment variables
-
Mandatory values, read-only
On top of OmegaConf Hydra adds the ability to compose configuration from multiple sources
(leveraging the merge
utilities in OmegConf)
-
Directory as mutually exclusive parameter sets
-
Default list
-
Scopes
-
Load configuration in entry points, experimental support to load configuration programmatically (e.g.: from a notebook)
-
A configurable syntax to overrides parameters from the commande line (+autocompletion)
-
Hydra settings, Job configuration
-
Ability to generate parameters lists, search spaces
Beyond its capabilities, Hydra is interesting because of is attractiveness to researchers. Whereas Kedro can be very frustrating because of the imposed (but well-meaning) structure and limitation (ex: ).
When experiments grow, when they get closer to industrial applications, when benchmarks become benchmarks and models become pipelines: it becomes useful to organise the configuration in way that keep compl
In theory, bringing Hydra configuration to Kedro could:
-
Increase Kedro usability for ML researcher.
-
Facilitate the transition from Hydra projects to Kedro.
-
Bring additional features toK project: for an extended syntax for parameter overrides and parameters auto-completion.
Override the ConfigLoader
_load_config_file
method to load Hydra config Load, )
from kedro.config import ConfigLoader
class HydraConfigLoaderMinimal(ConfigLoader):
def _load_config_file(self, config_file: Path) -> Dict[str, Any]:
from hydra.experimental import initialize_config_dir, compose
from omegaconf import OmegaConf
with initialize_config_dir(config_dir=str(config_file.parent), job_name="app"):
conf = compose(config_name=config_file.name, overrides=[])
resolved_conf = OmegaConf.to_container(conf, resolve=self.resolve_interpolation)
return {k: v for k, v in resolved_conf.items() if not k.startswith("_")}
This code works but has been simplied, check the reommened implementation here: hydra_config_loader_minimal.py.
//TODO
If the Kedro configuration was using the string interpolation features from TemplatedCongifLoader
we must do a few edits:
First we import the globals variables files (globals.yml
) inside each configuration file using the Hydra default list directive:
defaults:
- globals.yml # <-- import globals.yml variables
example_iris_data:
type: pandas.CSVDataSet
filepath: "${_directories.raw}/iris.csv" # String inperpolation from imported variables
Note: if your are using the file name extension .yaml
instead of .yml
, you need to remove file extension the the globals.yml
: just set - globals
(Hydra prefers the .yaml
file extension).
After this change, the Hydra configuration loader will raise a warning directing us to set the globals.yml package scope explicitly. To supress this warning, we just need to add the directive # @package _global_
in globals.yml
:
# @package _global_
_directories:
raw: "./data/01_raw"
interim: "/tmp/02_interim"
processed: "/data/03_processed"
Note that we prefix global keys with a underscore to hide them from the final configuration structure (globals variables are only used for string interpolation).
✅ Can be used as a drop-in replacement of the kedro ConfigLoader
: most existing configurationshould be
✅ Mutually exclusive configuration groups: https://hydra.cc/docs/terminology#config-group
✅ Add support for OmegaConf interpolation patterns(including access to environment variables and hydra configuration):
❌ no easy way to specify Hydra configuration overrides from the kedro run
command line tool.
Parameters provided with the --params
option will ovverride the final configuration parameters
but won't impact Hydra group choices (nor interpolation). This means that there is no way to change
hydra group choices appart from editing the defaults
list in yaml files.
Last resort hack
A workarround, if you really want to apply overrides commands to *one* particular file in your configuration, use at your own risk- Add a
hydra_overrides
parameters incli.py
:
@click.option(
"--params", type=str, default="", help=PARAMS_ARG_HELP callback=_split_params
)
+@click.argument('hydra_overrides', nargs=-1)
def run(
tag,
env,
parallel,
runner,
is_async,
node_names,
to_nodes,
from_nodes,
from_inputs,
to_outputs,
load_version,
pipeline,
config,
params,
+ hydra_overrides,
):
- All cli arguments are registred in the kedro session object and can be retrievied within the
HydraConfigLoaderMinimal
:
def _load_config_file(self, config_file: Path, overrides: List[str] = []) -> Dict[str, Any]:
from hydra.experimental import compose, initialize_config_dir
from omegaconf import OmegaConf
overrides = overrides + self.global_overrides
+ override_path_pattern = Path(self.conf_paths[0]) / 'parameters.yml'
+ session = get_current_session(silent=True)
+ if session and (config_file.resolve() == override_path_pattern.resolve()):
+ overrides.extend(session.store['cli']['params']['hydra_overrides'])
with initialize_config_dir(config_dir=str(config_file.parent), job_name=self.job_name):
Here the hydra overrides will be applied only to the base/parameters.yml
file.
❌ Do not work well with kedro config enviroments: As each file is parsed by
❌ Powerfull but complex: with serveral environements, where each one can have serveral root configuration files, which are themselve parsed as Hydra configuration that can have nested parameters groups files, we may have created a monster!
This time we load load Kedro configuration from a single hydra configuration root file.
The trick to make the Hydra configuration works with Kedro glob path patterns
(conf_loader.get("parameters/**")
syntax) is to convert keys in the configuration into paths.
For instance, the configuration entry conf['catalog']['iris']['filepath']
will is associated
with the path catalog/iris/filepath
and will be returned by the .get(
catalog/**)` call.
With thos approach should remains compatible with most kedro features and plugins.
We must first convert our kedro configuration into a Hydra compatible one:
-
The catalog entries must be set inside a 'catalog' root dictionnary in the Hydra configuration (the format remains the same as for Kedro: https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html ) To keep a seperate
catalog.yaml
file, we can leverage the Hydra default list features (see the example below). -
The same apply for 'parameters', 'logging', 'credentials' and any other Kedro root configuration files.
-
The kedro configuration environment system is completly replaced by a Hydra parameter group 'env'. A configuration file must be defined for each environment (except 'base') in the 'env' directory. This files can override any setting from other configuration files.
For instance:
-
conf/config.yaml
(single root configuration file):defaults: # Ordered list of parameter group (order is important for overrides). - catalog - parameters - catalog - _self_ # This file configution - env: local # Last, loads environment overrides (local by default) directories: raw: "./data/01_raw" interim: "/tmp/02_interim" processed: "/data/03_processed"
-
conf/catalog.yaml
(note how we set package scope with the# @package <scope>
directive):# @package catalog example_iris_data: type: pandas.CSVDataSet filepath: "${directories.raw}/iris.csv"
-
conf/env/local.yaml
(This time using a global scope):# @package _global_ # Override parameters: parameters: example_test_data_ratio: 0.5 # Overrides catalog entry configuration: catalog: example_iris_data2: filepath: "../data/iris.csv"
Notes:
-
Hydra works better with ".yaml" extension files (rather than ".yml"). When using an extension other than ".yaml" the file extension must be explicitly set in default lists (otherwise Hydra throws "Error: Could not load ").
-
Patterns given the
get
dictionary can also match nested keys in the Hydra configuration. The depth of the lookup can be controlled with thelookup_depth
parameter (= 1 by default). -
Root keys are not included in the returned configuration dict ('parameters', 'catalog' keys will not appear in the configuration only sub-dictionnaries keys and values are returned). This means that any parameter defined in the root configuration file can not be accessed directly using the
.get(*patterns)
method, however this values can still be used for string interpolation within other confuraton files. The root configuration files can then replace theglobals.yml
file from the Kedro TemplatedConfigLoader class.
Next we just need to register the Hydra config loader in hooks.py
, adding the env variable in
the Hydra overrides list:
@hook_impl
def register_config_loader(self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any]) -> ConfigLoader:
conf_root = Path(list(conf_paths)[0]).parent
return FullHydraConfigLoader(conf_root=conf_root, overrides=[f'env={env}'])
First make sure that autocompletion is enabled in your shell. For bash we need to add the line eval "$(_KEDRO_COMPLETE=source kedro)"
in .bashrc
(refer to the kedro documentation for other shells)
Next <>
✅ Simpler configuration structure without environment directory
✅ One unique Hydra configuration
✅ Minimal edition of the configuraton, separation of concerns preserved
❌ configuration not bakward compatibile with Kedro ConfigLoader
❌ error-prone translation of path patterns