End-to-End Machine Learning Pipeline with DVC and MLflow

This project demonstrates how to build a robust, end-to-end machine learning pipeline using DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. The pipeline is designed to train a Random Forest Classifier on the Pima Indians Diabetes Dataset, with stages for data preprocessing, model training, and evaluation.

Figure: Data flow of the end-to-end machine learning pipeline. The diagram illustrates the key stages, including preprocessing, training, and evaluation, and how data moves through the pipeline.

Key Features

1. Data Version Control (DVC)

Tracks and versions datasets, models, and pipeline stages to ensure reproducibility.
Automatically re-executes pipeline stages when dependencies change (e.g., data, scripts, parameters).
Supports remote data storage (e.g., DagsHub, S3) for managing large datasets and models.

2. Experiment Tracking with MLflow

Logs model hyperparameters (e.g., n_estimators, max_depth) and performance metrics (e.g., accuracy).
Tracks experiment metrics, parameters, and artifacts to compare different runs and optimize performance.

Pipeline Stages

1. Preprocessing

The preprocess.py script reads the raw dataset (data/raw/data.csv), renames columns, and outputs processed data to data/processed/data.csv.
Ensures consistent data preparation across pipeline executions.

2. Training

The train.py script trains a Random Forest Classifier on the preprocessed data.
Saves the trained model as models/random_forest.pkl.
Logs hyperparameters and model artifacts in MLflow for tracking and comparison.

3. Evaluation

The evaluate.py script evaluates the trained model's performance on the dataset.
Logs evaluation metrics (e.g., accuracy) in MLflow for performance tracking.

Technology Stack

Python: Core language for data processing, model training, and evaluation.
DVC: Tracks version control of data, models, and pipeline stages.
MLflow: Logs and tracks experiments, metrics, and artifacts.
Scikit-learn: Used to build and train the Random Forest Classifier.

This project demonstrates how to manage the entire lifecycle of a machine learning project by ensuring data, code, models, and experiments are all tracked, versioned, and reproducible.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.dvc		.dvc
data		data
model		model
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirement.txt		requirement.txt
stage_file.md		stage_file.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End Machine Learning Pipeline with DVC and MLflow

Key Features

1. Data Version Control (DVC)

2. Experiment Tracking with MLflow

Pipeline Stages

1. Preprocessing

2. Training

3. Evaluation

Technology Stack

About

Uh oh!

Releases

Packages

Languages

DevManpreet5/end_end_ML_pipeline_dagshub

Folders and files

Latest commit

History

Repository files navigation

End-to-End Machine Learning Pipeline with DVC and MLflow

Key Features

1. Data Version Control (DVC)

2. Experiment Tracking with MLflow

Pipeline Stages

1. Preprocessing

2. Training

3. Evaluation

Technology Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages