Skip to content

This project demonstrates how to build a robust, end-to-end machine learning pipeline using DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. The pipeline is designed to train a Random Forest Classifier on the Pima Indians Diabetes Dataset, with stages for data preprocessing, model training, and evaluation

Notifications You must be signed in to change notification settings

DevManpreet5/end_end_ML_pipeline_dagshub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

End-to-End Machine Learning Pipeline with DVC and MLflow

This project demonstrates how to build a robust, end-to-end machine learning pipeline using DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. The pipeline is designed to train a Random Forest Classifier on the Pima Indians Diabetes Dataset, with stages for data preprocessing, model training, and evaluation.

Screenshot 2025-01-22 at 5 31 34 pm

Figure: Data flow of the end-to-end machine learning pipeline. The diagram illustrates the key stages, including preprocessing, training, and evaluation, and how data moves through the pipeline.


Key Features

1. Data Version Control (DVC)

  • Tracks and versions datasets, models, and pipeline stages to ensure reproducibility.
  • Automatically re-executes pipeline stages when dependencies change (e.g., data, scripts, parameters).
  • Supports remote data storage (e.g., DagsHub, S3) for managing large datasets and models.

2. Experiment Tracking with MLflow

  • Logs model hyperparameters (e.g., n_estimators, max_depth) and performance metrics (e.g., accuracy).
  • Tracks experiment metrics, parameters, and artifacts to compare different runs and optimize performance.

Pipeline Stages

1. Preprocessing

  • The preprocess.py script reads the raw dataset (data/raw/data.csv), renames columns, and outputs processed data to data/processed/data.csv.
  • Ensures consistent data preparation across pipeline executions.

2. Training

  • The train.py script trains a Random Forest Classifier on the preprocessed data.
  • Saves the trained model as models/random_forest.pkl.
  • Logs hyperparameters and model artifacts in MLflow for tracking and comparison.

3. Evaluation

  • The evaluate.py script evaluates the trained model's performance on the dataset.
  • Logs evaluation metrics (e.g., accuracy) in MLflow for performance tracking.

Technology Stack

  • Python: Core language for data processing, model training, and evaluation.
  • DVC: Tracks version control of data, models, and pipeline stages.
  • MLflow: Logs and tracks experiments, metrics, and artifacts.
  • Scikit-learn: Used to build and train the Random Forest Classifier.

This project demonstrates how to manage the entire lifecycle of a machine learning project by ensuring data, code, models, and experiments are all tracked, versioned, and reproducible.

About

This project demonstrates how to build a robust, end-to-end machine learning pipeline using DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. The pipeline is designed to train a Random Forest Classifier on the Pima Indians Diabetes Dataset, with stages for data preprocessing, model training, and evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages