This project demonstrates how to build a robust, end-to-end machine learning pipeline using DVC (Data Version Control) for data and model versioning and MLflow for experiment tracking. The pipeline is designed to train a Random Forest Classifier on the Pima Indians Diabetes Dataset, with stages for data preprocessing, model training, and evaluation.
Figure: Data flow of the end-to-end machine learning pipeline. The diagram illustrates the key stages, including preprocessing, training, and evaluation, and how data moves through the pipeline.
- Tracks and versions datasets, models, and pipeline stages to ensure reproducibility.
- Automatically re-executes pipeline stages when dependencies change (e.g., data, scripts, parameters).
- Supports remote data storage (e.g., DagsHub, S3) for managing large datasets and models.
- Logs model hyperparameters (e.g.,
n_estimators,max_depth) and performance metrics (e.g., accuracy). - Tracks experiment metrics, parameters, and artifacts to compare different runs and optimize performance.
- The
preprocess.pyscript reads the raw dataset (data/raw/data.csv), renames columns, and outputs processed data todata/processed/data.csv. - Ensures consistent data preparation across pipeline executions.
- The
train.pyscript trains a Random Forest Classifier on the preprocessed data. - Saves the trained model as
models/random_forest.pkl. - Logs hyperparameters and model artifacts in MLflow for tracking and comparison.
- The
evaluate.pyscript evaluates the trained model's performance on the dataset. - Logs evaluation metrics (e.g., accuracy) in MLflow for performance tracking.
- Python: Core language for data processing, model training, and evaluation.
- DVC: Tracks version control of data, models, and pipeline stages.
- MLflow: Logs and tracks experiments, metrics, and artifacts.
- Scikit-learn: Used to build and train the Random Forest Classifier.
This project demonstrates how to manage the entire lifecycle of a machine learning project by ensuring data, code, models, and experiments are all tracked, versioned, and reproducible.