Hydra is a handy and powerful tool that can dramatically reduce our boilerplate codes and combine dynamically various configurations. I started out with the idea that Hydra
could be used in ML Pipeline as well, and this is a Python app in the form of a template that I quickly implemented. Feedback is always welcome.
Our ML pipeline consists of the following three steps. I think this is the minimum steps for the ML pipeline, and you can add other steps as you need.
preprocessing
: prepare datamodeling
: train, validate modeldeployment
: deploy model to serving cluster
This app has a two-level command architecture. (c) The command line arguments for executing each command are as follows:
├── preprocessing
│ ├── foo -> c=preprocessing c/preprocessing_sub=foo
│ └── bar -> c=preprocessing c/preprocessing_sub=bar
├── modeling
│ ├── foo -> c=modeling c/modeling_sub=foo
│ └── bar -> c=modeling c/modeling_sub=bar
├── deployment
│ ├── foo -> c=deployment c/deployment_sub=foo
│ └── bar -> c=deployment c/deployment_sub=foo
└── help -> c=help
Here are the configurations prepared for this app. (preprocessing, modeling, deployment) The command line arguments for using each configuration are as follows:
├── preprocessing
│ ├── dataset
│ │ ├── dataset_1.yaml -> preprocessing/dataset=dataset_1
│ │ └── dataset_2.yaml -> preprocessing/dataset=dataset_2
│ └── param
│ ├── param_1.yaml -> preprocessing/param=param_1
│ └── param_2.yaml -> preprocessing/param=param_2
├── modeling
│ ├── model
│ │ ├── model_1.yaml -> modeling/model=model_1
│ │ └── model_1.yaml -> modeling/model=model_2
│ └── param
│ ├── param_1.yaml -> modeling/param=param_1
│ └── param_2.yaml -> modeling/param=param_2
└── deployment
└── cluster
├── cluster_1.yaml -> deployment/cluster=cluster_1
└── cluster_1.yaml -> deployment/cluster=cluster_2
# Create a new Anaconda environment if needed.
$ conda create --name when_ml_pipeline_meets_hydra python=3.6 -y
$ conda activate when_ml_pipeline_meets_hydra
# Clone this repo.
$ git clone https://github.com/withsmilo/When-ML-pipeline-meets-Hydra.git
$ cd When-ML-pipeline-meets-Hydra
# Install this app.
$ python setup.py develop
$ when_ml_pipeline_meets_hydra --help
I will construct a new ML pipeline dynamically using all *_1.yaml
configurations and executing the same foo
subcommand per each step. The command you need is simple and structured.
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_1 \
deployment/cluster=cluster_1 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
[2019-10-13 22:12:22,032] - Launching 3 jobs locally
[2019-10-13 22:12:22,032] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:12:22,032] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1
p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1
Do something here!
[2019-10-13 22:12:22,175] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1
m_param:
hyperparam_key_1_1: hyperparam_value_1_1
hyperparam_key_1_2: hyperparam_value_1_2
name: param_1
Do something here!
[2019-10-13 22:12:22,314] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_1 deployment/cluster=cluster_1 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_1
name: cluster_1
pw: pw_1
url: https://cluster/1/url
Do something here!
After then, if you'd like to deploy a model that has only changed the hyperparameter settings to another serving cluster, you can simply change modeling/param
to param_2.yaml
and deployment/cluster
to cluster_2.yaml
before executing your command. That's it!
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_2 \
deployment/cluster=cluster_2 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
[2019-10-13 22:13:13,898] - Launching 3 jobs locally
[2019-10-13 22:13:13,898] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:13:13,898] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1
p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1
Do something here!
[2019-10-13 22:13:14,040] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1
m_param:
hyperparam_key_2_1: hyperparam_value_2_1
hyperparam_key_2_2: hyperparam_value_2_2
name: param_2
Do something here!
[2019-10-13 22:13:14,179] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_2
name: cluster_2
pw: pw_3 # For testing purposes, assume that this data is wrong
url: https://cluster/2/url
Do something here!!
Oops. You found wrong configuration("pw": "pw_3"
) and want to fix it quickly. To do this, you only need to add cluster.pw=pw_2
to you command line.
$ when_ml_pipeline_meets_hydra \
preprocessing/dataset=dataset_1 \
preprocessing/param=param_1 \
modeling/model=model_1 \
modeling/param=param_2 \
deployment/cluster=cluster_2 \
cluster.pw=pw_2 \
c/preprocessing_sub=foo \
c/modeling_sub=foo \
c/deployment_sub=foo \
c=preprocessing,modeling,deployment \
--multirun
[2019-10-13 22:13:43,246] - Launching 3 jobs locally
[2019-10-13 22:13:43,246] - Sweep output dir : .multirun/2019-10-13
[2019-10-13 22:13:43,246] - #0 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=preprocessing cluster.pw=pw_2
========== Run preprocessing's 'foo' subcommand ==========
dataset:
name: dataset_1
path: /path/of/dataset/1
p_param:
key_1_1: value_1_1
key_1_2: value_1_2
name: param_1
output_path: /path/of/output/path/1
Do something here!
[2019-10-13 22:13:43,391] - #1 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=modeling cluster.pw=pw_2
========== Run modeling's 'foo' subcommand ==========
model:
input_path: /path/of/input/path/1
name: model_1
output_path: /path/of/output/path/1
m_param:
hyperparam_key_2_1: hyperparam_value_2_1
hyperparam_key_2_2: hyperparam_value_2_2
name: param_2
Do something here!
[2019-10-13 22:13:43,531] - #2 : preprocessing/dataset=dataset_1 preprocessing/param=param_1 modeling/model=model_1 modeling/param=param_2 deployment/cluster=cluster_2 c/preprocessing_sub=foo c/modeling_sub=foo c/deployment_sub=foo c=deployment cluster.pw=pw_2
========== Run deployment's 'foo' subcommand ==========
cluster:
id: user_2
name: cluster_2
pw: pw_2
url: https://cluster/2/url
Do something here!
As well as this scenario, you can think of various cases.
This project has been set up using PyScaffold 3.2.2. For details and usage information on PyScaffold see https://pyscaffold.org/.
This app is licensed under MIT License.