"This DALL-E generated image, within Japan, Kedro orchestrates the rhythm of renewable insights amidst the choreography of data and predictions."
In this project, I challenged myself to transform notebook-based code for model training into a Kedro pipeline. The goal is to create modular, easy-to-train pipelines that follow the best MLOps practices, simplifying the deployment of ML models. With Kedro, you can execute just one command to train your models and obtain your pickle files, performance figures, etc. (Yes, just ONE commandβοΈ). Parameters can be easily adjusted in a YAML file, allowing for the addition of different steps and the testing of various models with ease. Additionally, Kedro provides visualization and logging features to keep you informed about everything. You can create all types of pipelines, not only for machine learning but for any data-driven workflow.
For an in-depth understanding of Kedro, consider exploring the official documentation at Kedro's Documentation.
Additionally, I integrated a CI pipeline on Github Actions
for code quality checks and code functionality assurance, enhancing reliability and maintainability β
The objectives were:
- Transition to Production: Convert code from
Jupyter Notebooks
to aproduction-ready
and easily deployable format. - Model Integration: Facilitate the straightforward addition of models, along with their performance metrics, into the pipeline.
- Workflow Optimization: Utilize the
Kedro framework
to establish reproducible, modular, and scalable data workflows. - CI/CD Automation: Implement an automated CI/CD pipeline using
GitHub Actions
to ensure continuous testing and code quality management. - Dockerization: Develop a Dockerized pipeline for ease of use, incorporating
Docker volumes
for persistent data management.
Before I started making Kedro pipelines
, I tried out my ideas in Jupyter notebooks. Check the notebooks
folder to see how I did it:
-
EDA & Data Preparation - Energy_Forecasting.ipynb: Offers insights into how I analyzed the data and prepared it for modeling, including data preparation and cleaning processes.
-
Machine Learning - Energy_Forecasting.ipynb: Documents how I evaluated and trained various Machine Learning models, testing the Random Forest, XGBoost, and LightGBM models.
Within the src
directory lies the essence, with each component neatly arranged in a Kedro pipeline:
- Data Processing: Standardizes and cleans data in ZIP and CSV formats, preparing it for analysis. π
- Feature Engineering: Creates new features. π οΈ
- Train-Test Split Pipeline: A dedicated pipeline to split the data into training and test sets. π
- Model Training + Model Evaluation: Constructs separate pipelines for XGBoost, LightGBM and Random Forest, modular and independent, capable of training in async mode. π€
The Kedro Viz tool
provides an interactive canvas to visualize and understand the pipeline structure. It illustrates data flow, dependencies, and the orchestration of nodes and pipelines. Here is the visualization of this project:
With this tool, the understanding of data progression, outputs, and interactivity is greatly simplified. Kedro Viz allows users to inspect samples of data, view parameters, analyze figures, and much more, enriching the user experience with enhanced transparency and interactivity.
Logging is integral to understanding and troubleshooting pipelines. This project leverages Kedro's logging capabilities to provide real-time insights into pipeline execution, highlighting progress, warnings, and errors. This GIF demonstrates the use of the kedro run
or make run
command, showcasing the logging output in action:
Notice how the nodes are executed sequentially, and observe the RMSE outputs during validation for the XGBoost model. Logging in Kedro is highly customizable, allowing for tailored monitoring that meets the user's specific needs.
A simplified overview of the Kedro project's structure:
Kedro-Energy-Forecasting/
β
βββ conf/ # Configuration files for Kedro project
β βββ base/
β β βββ catalog.yml # Data catalog with dataset definitions
β β βββ parameters_data_processing_pipeline.yml # Parameters for data processing
β β βββ parameters_feature_engineering_pipeline.yml # Parameters for feature engineering
β β βββ parameters_random_forest_pipeline.yml # Parameters for Random Forest pipeline
β β βββ parameters_lightgbm_training_pipeline.yml # Parameters for LightGBM pipeline
β β βββ parameters_train_test_split_pipeline.yml # Parameters for train-test split
β β βββ parameters_xgboost_training_pipeline.yml # Parameters for XGBoost training
β βββ local/
β
βββ data/
β βββ 01_raw/ # Raw, unprocessed datasets
β βββ 02_processed/ # Cleaned and processed data ready for analysis
β βββ 03_training_data/ # Train/Test Datasets used for model training
β βββ 04_reporting/ # Figures and Results after running the pipelines
β βββ 05_model_output/ # Trained pickle models
β
βββ src/
β βββ pipelines/
β β βββ data_processing_pipeline/ # Data processing pipeline
β β βββ feature_engineering_pipeline/ # Feature engineering pipeline
β β βββ random_forest_pipeline/ # Random Forest pipeline
β β βββ lightgbm_training_pipeline/ # LightGBM pipeline
β β βββ train_test_split_pipeline/ # Train-test split pipeline
β β βββ xgboost_training_pipeline/ # XGBoost training pipeline
β βββ energy_forecasting_model/ # Main module for the forecasting model
β
βββ .gitignore # Untracked files to ignore
βββ Makefile # Set of tasks to be executed
βββ Dockerfile # Instructions for building a Docker image
βββ .dockerignore # Files and directories to ignore in Docker builds
βββ README.md # Project documentation and setup guide
βββ requirements.txt # Project dependencies
First, Clone the Repository to download a copy of the code onto your local machine, and before diving into transforming raw data into a trained pickle Machine Learning model, please note:
Before you begin, please follow these preliminary steps to ensure a smooth setup:
-
Clear Existing Data Directories: If you're planning to run the pipeline, i recommend removing these directories if they exist:
data/02_processed
,data/03_training_data
,data/04_reporting
, anddata/05_model_output
(leave onlydata/01_raw
in thedata
folder). They will be recreated or updated once the pipeline runs. These directories are tracked in version control to provide you with a glimpse of the expected outputs. -
Makefile Usage: To utilize the Makefile for running commands, you must have
make
installed on your system. Follow the instructions in the installation guide to set it up.
Here is an example of the available targets: (you type make
in the command line)
- Running the Kedro Pipeline:
- For production environments, initialize your setup by executing
make prep-doc
or usingpip install -r docker-requirements.txt
to install the production dependencies. - For a development environment, where you may want to use Kedro Viz, work with Jupyter notebooks, or test everything thoroughly, run
make prep-dev
orpip install -r dev-requirements.txt
to install all the development dependencies.
- For production environments, initialize your setup by executing
Adopt this method if you prefer a traditional Python development environment setup using Conda or venv.
-
Set Up the Environment: Initialize a virtual environment with Conda or venv to isolate and manage your project's dependencies.
-
Install Dependencies: Inside your virtual environment, execute
pip install -r dev-requirements.txt
to install the necessary Python libraries. -
Run the Kedro Pipeline: Trigger the pipeline processing by running
make run
or directly withkedro run
. This step orchestrates your data transformation and modeling. -
Review the Results: Inspect the
04_reporting
and05_model_output
directories to assess the performance and outcomes of your models. -
(Optional) Explore with Kedro Viz: To visually explore your pipeline's structure and data flows, initiate Kedro Viz using
make viz
orkedro viz run
.
Prefer this method for a containerized approach, ensuring a consistent development environment across different machines. Ensure Docker is operational on your system before you begin.
-
Build the Docker Image: Construct your Docker image with
make build
orkedro docker build
. This command leveragesdev-requirements.txt
for environment setup. For advanced configurations, see the Kedro Docker Plugin Documentation. -
Run the Pipeline Inside a Container: Execute the pipeline within Docker using
make dockerun
orkedro docker run
. Kedro-Docker meticulously handles volume mappings to ensure seamless data integration between your local setup and the Docker environment. -
Access the Results: Upon completion, the
04_reporting
and05_model_output
directories will contain your model's reports and trained files, ready for review.
For additional assistance or to explore more command options, refer to the Makefile or consult kedro --help
.
With our Kedro Pipeline π now capable of efficiently transforming raw data π into trained models π€, and the introduction of a Dockerized environment π³ for our code, the next phase involves advancing beyond the current repository scope π to orchestrate data updates automatically
using tools like Databricks, Airflow, Azure Data Factory... This progression allows for the seamless integration of fresh data into our models.
Moreover, implementing experiment tracking and versioning
with MLflow π or leveraging Kedro Viz's versioning capabilities π will significantly enhance our project's management and reproducibility. These steps are pivotal for maintaining a clean machine learning workflow that not only achieves our goal of simplifying model training processes π but also ensures our system remains dynamic and scalable with minimal effort.
You can connect with me on LinkedIn or check out my GitHub repositories: