Skip to content

khuyentran1401/hydra_demo

Repository files navigation

Demo for Hydra

About Hydra

Hydra is a Python tool to manage complex configurations in your data science projects.

How to Run the Project

  1. Clone this repository:
git clone https://github.com/khuyentran1401/hydra_demo.git
  1. Install Poetry
  2. Set up the environment:
make activate
make setup

Introduction to Hydra

Folders

Folders shown in the video:

Short Summary

Imagine your YAML configuration file looks like this:

process:
  keep_columns:
      - Income
      - Recency
      - NumWebVisitsMonth
      - Complain
      - age
      - total_purchases
      - enrollment_years
      - family_size

  remove_outliers_threshold:
    age: 90
    Income: 600000

To access the list under process.keep_columns in the configuration file, simple add the @hydra.main decorator to the function that uses the configuration:

import hydra
from omegaconf import DictConfig, OmegaConf


@hydra.main(config_path="../config", config_name="main")
def process_data(config: DictConfig):

    print(config.process.keep_columns)

process_data()

Output:

['Income', 'Recency', 'NumWebVisitsMonth', 'Complain', 'age', 'total_purchases', 'enrollment_years', 'family_size']

Group Configuration Files and Override the Parameters on the Command Line

Folders

Folders shown in the video:

Short Summary

Imagine the structure of your config directory looks like this:

config
├── main.yaml
└── process
    ├── process_1.yaml
    ├── process_2.yaml
    ├── process_3.yaml
    └── process_4.yaml

Each file has different values for the same parameters. You can set the parameters in the file process_2.yaml as default by adding the following to main.yaml:

defaults:
  - process: process_2
  - _self_

Now the parameters in main.yaml are merged with the parameters in process_2.yaml.

Running the file print_config.py:

python print_config.py

should print:

# From process_2.yaml
process:
  keep_columns:
  - Income
  - Recency
  - NumWebVisitsMonth
  - Complain
  - age
  - total_purchases
  - enrollment_years
  - family_size
  remove_outliers_threshold:
    age: 90
    Income: 600000
  family_size:
    Married: 2
    Together: 2
    Absurd: 1
    Widow: 1
    YOLO: 1
    Divorced: 1
    Single: 1
    Alone: 1

# From main.yaml
raw_data:
  path: data/raw/marketing_campaign.csv
intermediate:
  dir: data/intermediate
  name: scale_features.csv
  path: ${intermediate.dir}/${intermediate.name}
flow: all
image:
  kmeans: image/elbow.png
  clusters: image/cluster.png

You can also override the default parameters on the command line. For example, to replace process_2 with process_1, run the following:

python print_config.py process=process_1

The output should be the combination of all parameters in main.yaml and in process_1.yaml:

# From process_1.yaml
process:
  keep_columns:
  - Income
  - Recency
  - NumWebVisitsMonth
  - AcceptedCmp3
  - AcceptedCmp4
  - AcceptedCmp5
  - AcceptedCmp1
  - AcceptedCmp2
  - Complain
  - Response
  - age
  - total_purchases
  - enrollment_years
  - family_size
  remove_outliers_threshold:
    age: 90
    Income: 600000
  family_size:
    Married: 2
    Together: 2
    Absurd: 1
    Widow: 1
    YOLO: 1
    Divorced: 1
    Single: 1
    Alone: 1
    
# From main.yaml
raw_data:
  path: data/raw/marketing_campaign.csv
intermediate:
  dir: data/intermediate
  name: scale_features.csv
  path: ${intermediate.dir}/${intermediate.name}
flow: all
image:
  kmeans: image/elbow.png
  clusters: image/cluster.png

About

Demo of Hydra

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published