This repository demonstrates a comprehensive, production-grade machine learning workflow using DVC (Data Version Control) and MLflow. The pipeline is built around the Pima Indians Diabetes Dataset and a Random Forest Classifier, but is easily extensible to other datasets and models. The project emphasizes reproducibility, experiment tracking, and collaborative development—key principles of modern MLOps.
- Reproducibility: Ensure that every result can be traced and reproduced, regardless of environment or team member.
- Experimentation: Enable rapid, organized experimentation with clear tracking of parameters, metrics, and artifacts.
- Collaboration: Facilitate teamwork by versioning data, code, and models, and by supporting remote storage.
- Scalability: Provide a pipeline structure that can be extended to more complex workflows and larger datasets.
The pipeline is modular and consists of three main stages, orchestrated by DVC:
- Script:
src/preprocess.py - Input:
data/raw/data.csv - Output:
data/processed/data.csv - Description:
- Loads the raw dataset.
- Cleans and preprocesses data (e.g., renaming columns, handling missing values, feature engineering).
- Saves the processed dataset for downstream tasks.
- DVC Stage Example:
dvc stage add -n preprocess \ -d src/preprocess.py -d data/raw/data.csv \ -o data/processed/data.csv \ python src/preprocess.py
- Script:
src/train.py - Input:
data/processed/data.csv - Output:
models/model.pkl - Description:
- Loads the processed data.
- Trains a Random Forest Classifier (hyperparameters configurable via
params.yaml). - Performs hyperparameter tuning using GridSearchCV.
- Saves the trained model.
- Logs parameters, metrics, and model artifacts to MLflow (hosted on DagsHub).
- Logs confusion matrix and classification report as artifacts.
- Uses DagsHub as the remote MLflow tracking server for centralized experiment management.
- DVC Stage Example:
dvc stage add -n train \ -d src/train.py -d data/processed/data.csv \ -o models/model.pkl \ python src/train.py
- Script:
src/evaluate.py - Input:
models/model.pkl,data/processed/data.csv - Description:
- Loads the trained model and processed data.
- Evaluates model performance (e.g., accuracy, confusion matrix).
- Logs evaluation metrics to MLflow.
- DVC Stage Example:
dvc stage add -n evaluate \ -d src/evaluate.py -d models/model.pkl -d data/processed/data.csv \ python src/evaluate.py
machine_learning_pipeline/
│
├── data/
│ ├── raw/ # Original datasets
│ └── processed/ # Cleaned and feature-engineered datasets
├── models/ # Trained model artifacts
├── src/ # Source code for each pipeline stage
│ ├── preprocess.py
│ ├── train.py
│ └── evaluate.py
├── dvc.yaml # DVC pipeline definition
├── dvc.lock # DVC pipeline lock file
├── params.yaml # Pipeline and model hyperparameters
├── requirements.txt # Python dependencies
└── Readme.md
git clone <repository-url>
cd machine_learning_pipelinepip install -r requirements.txtPlace the raw dataset in data/raw/data.csv.
(You can use the Pima Indians Diabetes Dataset from Kaggle).
Edit params.yaml to adjust preprocessing or model hyperparameters (e.g., number of estimators, max depth).
dvc reproThis will execute all pipeline stages in order, only re-running stages if their dependencies have changed.
This project uses DagsHub as the remote MLflow tracking server.
MLflow tracking URI and credentials are set in src/train.py:
os.environ['MLFLOW_TRACKING_URI'] = "https://dagshub.com/<username>/<repo>.mlflow"
os.environ["MLFLOW_TRACKING_USERNAME"] = "<username>"
os.environ["MLFLOW_TRACKING_PASSWORD"] = "<token>"Start the MLflow UI (if running locally):
mlflow uiOr, view your experiments directly on your DagsHub project page.
DVC is configured to use DagsHub as the remote storage for datasets and models.
To push your data and models to DagsHub:
dvc remote add origin https://dagshub.com/<username>/<repo>.dvc
dvc remote default origin
dvc push- Change parameters in
params.yamland re-rundvc reproto trigger new experiments. - Track all runs in MLflow (hosted on DagsHub), including parameters, metrics, and model artifacts.
- Compare experiments visually in the MLflow UI or on DagsHub to select the best model.
- Share results and pipeline state with your team using Git, DVC, and DagsHub.
- Add new models:
Create a new training script (e.g.,train_xgboost.py), add a new DVC stage, and log results to MLflow. - Feature engineering:
Expandpreprocess.pyor add new preprocessing stages. - Model deployment:
Integrate deployment scripts or use MLflow’s model serving capabilities.
- DVC Documentation
- MLflow Documentation
- DagsHub Documentation
- Scikit-learn Documentation
- Pima Indians Diabetes Dataset
- Data Science Teams:
Collaborate efficiently with versioned data, code, and experiments. - Research & Prototyping:
Rapidly iterate and compare models with full reproducibility. - Production ML:
Lay the foundation for robust, production-ready ML workflows.
Keep learning and keep growing
This project is provided for educational and demonstration purposes.