Hydra is a Python tool to manage complex configurations in your data science projects.
- Clone this repository:
git clone https://github.com/khuyentran1401/hydra_demo.git
- Install Poetry
- Set up the environment:
make activate
make setup
Folders shown in the video:
Imagine your YAML configuration file looks like this:
process:
keep_columns:
- Income
- Recency
- NumWebVisitsMonth
- Complain
- age
- total_purchases
- enrollment_years
- family_size
remove_outliers_threshold:
age: 90
Income: 600000
To access the list under process.keep_columns
in the configuration file, simple add the @hydra.main
decorator to the function that uses the configuration:
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path="../config", config_name="main")
def process_data(config: DictConfig):
print(config.process.keep_columns)
process_data()
Output:
['Income', 'Recency', 'NumWebVisitsMonth', 'Complain', 'age', 'total_purchases', 'enrollment_years', 'family_size']
Folders shown in the video:
Imagine the structure of your config
directory looks like this:
config
├── main.yaml
└── process
├── process_1.yaml
├── process_2.yaml
├── process_3.yaml
└── process_4.yaml
Each file has different values for the same parameters. You can set the parameters in the file process_2.yaml
as default by adding the following to main.yaml
:
defaults:
- process: process_2
- _self_
Now the parameters in main.yaml
are merged with the parameters in process_2.yaml
.
Running the file print_config.py
:
python print_config.py
should print:
# From process_2.yaml
process:
keep_columns:
- Income
- Recency
- NumWebVisitsMonth
- Complain
- age
- total_purchases
- enrollment_years
- family_size
remove_outliers_threshold:
age: 90
Income: 600000
family_size:
Married: 2
Together: 2
Absurd: 1
Widow: 1
YOLO: 1
Divorced: 1
Single: 1
Alone: 1
# From main.yaml
raw_data:
path: data/raw/marketing_campaign.csv
intermediate:
dir: data/intermediate
name: scale_features.csv
path: ${intermediate.dir}/${intermediate.name}
flow: all
image:
kmeans: image/elbow.png
clusters: image/cluster.png
You can also override the default parameters on the command line. For example, to replace process_2
with process_1
, run the following:
python print_config.py process=process_1
The output should be the combination of all parameters in main.yaml
and in process_1.yaml
:
# From process_1.yaml
process:
keep_columns:
- Income
- Recency
- NumWebVisitsMonth
- AcceptedCmp3
- AcceptedCmp4
- AcceptedCmp5
- AcceptedCmp1
- AcceptedCmp2
- Complain
- Response
- age
- total_purchases
- enrollment_years
- family_size
remove_outliers_threshold:
age: 90
Income: 600000
family_size:
Married: 2
Together: 2
Absurd: 1
Widow: 1
YOLO: 1
Divorced: 1
Single: 1
Alone: 1
# From main.yaml
raw_data:
path: data/raw/marketing_campaign.csv
intermediate:
dir: data/intermediate
name: scale_features.csv
path: ${intermediate.dir}/${intermediate.name}
flow: all
image:
kmeans: image/elbow.png
clusters: image/cluster.png