Description
Hi, I have a setup where I use a single pipeline (with several stages) for training multiple models which are almost the same, but use different training data and parameters.
I currently have a copy of a dvc.yaml pipeline in a folder with the respetice params.yaml
file used for each model. It looks more or less like this
stages:
train_test_split:
wdir: ../../../..
cmd: >-
python modules/regression/train_test_split.py
--params=${paths.params_file}
deps: ...
outs: ...
params:
- ${paths.params_file}: # needs to be set due to a different working directory
- paths.data_all
- train_test_split
assemble_model: ...
optimize_hyperparams: ...
fit_model: ...
evaluate: ...
This works (I then always run dvc repro -P
) but I have to copy the pipeline file which makes versioning difficult. The only part that is not (since it cannot be AFAIK) templated is the default params file.
I would love to have a dvc.yaml file in the root folder of my project which can be run with several different params.yaml files from several locations. Kind of like foreach ... do
but on the level of the entire pipeline.
Also, I believe I have to explicitly add the path to the params file under the params
keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)
Thanks a lot!
P.S.: I tried a similar setup with templating all the stages but there are limitations in the way templating and foreach do work right now and also I feel like this would be a more elegant way to do this. The pipelines and the overall architecture are the same, what is different are the training data and (some) parameters, so having an option like "for each params file in list reproduce a separate instance of the pipeline" would make a lot of sense to me (it would them make sense to have separate lock files as well)