Skip to content

End-to-End ML project with MLflow for experiment tracking, data ingestion to deployment. Includes data validation, transformation, training, evaluation, Docker packaging, AWS EC2 deployment, and CI/CD via GitHub Actions & ECR.

License

Notifications You must be signed in to change notification settings

dev618/End-To-End-Data-Science-with-MLFlow

Repository files navigation

End-to-End Data Science with MLflow

Production-ready MLOps template: experiment tracking with MLflow, modular pipelines, and AWS CI/CD (ECR, EC2, GitHub Actions self-hosted runner).

python mlflow docker aws license status

Tags

Tags:
mlops mlflow aws ecr ec2 docker cicd sklearn python dagshub github-actions production-ml endtoendml machinelearning datascience modeldeployment featureengineering modelregistry trainingpipeline experimenttracking


πŸ“Œ Table of Contents


Project Overview

This repository demonstrates a clean, modular, and reproducible machine-learning pipeline:

  • Config-first design (config.yaml, params.yaml, schema.yaml)
  • Core pipeline stages: Data Ingestion β†’ Validation β†’ Transformation β†’ Model Training β†’ Evaluation
  • MLflow for experiment tracking, params, metrics, and artifacts
  • Docker packaging and AWS deployment via ECR + EC2
  • Optional GitHub Actions (self-hosted runner) for end-to-end CI/CD

πŸ“¦ Pipeline Stages

  • πŸ“ Folder structure creation (scaffold)
  • πŸ“₯ Data Ingestion (download/sync raw data)
  • πŸ§ͺ Data Validation (schema checks, null handling, range validation)
  • βš™οΈ Data Transformation (feature engineering, encoding, splitting, scaling)
  • πŸ€– Model Training (baseline & hyperparameter tuned models)
  • πŸ“Š Model Evaluation (metrics, drift detection hooks)
  • πŸ“Œ Model Tracker: MLflow (parameters, metrics, artifacts, versions)
  • 🐳 Model Packaging (Docker containerization)
  • πŸš€ Model Deployment (AWS EC2 instance)
  • πŸ”„ CI/CD (GitHub Actions β†’ ECR β†’ EC2 via self-hosted runner)

Tech Stack

  • Language: Python 3.8+
  • Core: MLflow, scikit‑learn, pandas, numpy
  • Serving: Flask/FastAPI (via app.py)
  • MLOps: Docker, GitHub Actions, AWS (ECR, EC2, IAM)

πŸš€ Quick Start (Local)

Clone git clone https://github.com/dev618/End-to-end-Machine-Learning-Project-with-MLflow cd End-to-end-Machine-Learning-Project-with-MLflow

Conda env conda create -n mlproj python=3.8 -y conda activate mlproj

Install deps pip install -r requirements.txt

Run the web app python app.py

Open your browser at the printed localhost:PORT.

Run the training pipeline python main.py

πŸ“Š Experiment Tracking (MLflow)

Local UI

mlflow ui Open the printed URL to explore runs (params, metrics, artifacts). Log from code: the pipeline components log_params, log_metrics, and log_artifacts are called for each stage.

Remote Tracking on DagsHub

Docs: https://dagshub.com/ Project page: https://dagshub.com/dev618/End-To-End-Data-Science-with-MLFlow.mlflow

Set environment variables (PowerShell example): $env:MLFLOW_TRACKING_URI="https://dagshub.com/dev618/End-To-End-Data-Science-with-MLFlow.mlflow " $env:MLFLOW_TRACKING_USERNAME="<YOUR_DAGSHUB_USERNAME>" $env:MLFLOW_TRACKING_PASSWORD="<YOUR_DAGSHUB_TOKEN>"

⚠️ Do not commit secrets. Prefer GitHub Encrypted Secrets or local env vars.

⚑ CI/CD on AWS

1)AWS Console & IAM Create an IAM user with programmatic access and attach: AmazonEC2ContainerRegistryFullAccess AmazonEC2FullAccess

2)Create an ECR Repository Example URI: 549328952286.dkr.ecr.us-east-1.amazonaws.com/mlproj

3)Provision an EC2 (Ubuntu) host Install Docker: sudo apt-get update -y && sudo apt-get upgrade -y curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker ubuntu newgrp docker

4)Configure Self-Hosted Runner GitHub β†’ Settings β†’ Actions β†’ Runners β†’ New self-hosted runner (Linux) and run the displayed commands on EC2.

5)GitHub Secrets

  • AWS_ACCESS_KEY_ID=
  • AWS_SECRET_ACCESS_KEY=
  • AWS_REGION=us-east-1
  • AWS_ECR_LOGIN_URI=549328952286.dkr.ecr.us-east-1.amazonaws.com
  • ECR_REPOSITORY_NAME=mlproj

6)Sample GitHub Actions (build & push)

name: ci-cd on: [push] jobs: build-and-push: runs-on: self-hosted steps:

  • uses: actions/checkout@v4
  • name: Configure AWS creds uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: ${{ secrets.AWS_REGION }}
  • name: Login to ECR id: login-ecr uses: aws-actions/amazon-ecr-login@v2
  • name: Build, tag, and push image run: | IMAGE_URI=${{ secrets.AWS_ECR_LOGIN_URI }}/${{ secrets.ECR_REPOSITORY_NAME }}:$(git rev-parse --short HEAD) docker build -t $IMAGE_URI . docker push $IMAGE_URI
  • name: Deploy on EC2 run: | IMAGE_URI=${{ secrets.AWS_ECR_LOGIN_URI }}/${{ secrets.ECR_REPOSITORY_NAME }}:$(git rev-parse --short HEAD) docker pull $IMAGE_URI docker stop mlproj || true && docker rm mlproj || true docker run -d --name mlproj -p 80:80 -e MLFLOW_TRACKING_URI -e MLFLOW_TRACKING_USERNAME -e MLFLOW_TRACKING_PASSWORD $IMAGE_URI

πŸ”— Git Connectivity

git init git remote add origin https://github.com/dev618/End-to-end-Machine-Learning-Project-with-MLflow.git git checkout -b main git add . && git commit -m "init: project scaffold" git push -u origin main

βš™οΈ Configuration Files

config.yaml – Paths, uris, data locations, artifact directories schema.yaml – Data contracts used in validation (dtypes, ranges, required cols) params.yaml – Hyperparameters (splits, model params, thresholds)

Workflow to update:

Update config.yaml schema.yaml params.yaml Update entities Config manager (src/config) Components Pipeline main.py app.py

πŸ›  Pipelines & Components

Entities & Config Manager

Strongly-typed dataclasses for each stage (inputs/outputs/paths) Centralized loader to read YAMLs and expose stage configs

Components

data_ingestion.py – fetch/copy raw data into artifacts/ data_validation.py – validate vs schema.yaml (required columns, dtypes, NA) data_transformation.py – split train/test, encode, scale, feature build model_trainer.py – train baseline + tuned models, save under artifacts/model/ model_evaluation.py – compute metrics; log plots & confusion matrices

πŸ“ˆ MLflow Tracking

Logs: params, metrics, artifacts for each component Model registry (optional): promote staging β†’ production

πŸš€ Deployment

app.py exposes prediction API/UI Dockerfile containerizes the app (copy model artifacts + code) GitHub Actions builds and ships the image to ECR; EC2 pulls & runs

πŸ“‚ Folder Structure

End-to-End-Data-Science-with-MLFlow/
β”‚
β”œβ”€β”€ config/                # YAML files for config, schema, params
β”œβ”€β”€ src/                   # Core source code
β”‚   β”œβ”€β”€ components/        # Data ingestion, validation, transformation, etc.
β”‚   β”œβ”€β”€ pipeline/          # Training, evaluation pipelines
β”‚   β”œβ”€β”€ config/            # Configuration manager
β”‚   └── entity/            # Data entities
β”‚
β”œβ”€β”€ artifacts/             # Generated artifacts (data, models, logs)
β”œβ”€β”€ main.py                # Orchestration script
β”œβ”€β”€ app.py                 # Flask/FastAPI app for inference
β”œβ”€β”€ requirements.txt       # Project dependencies
β”œβ”€β”€ Dockerfile             # Docker build file
β”œβ”€β”€ .github/workflows/     # GitHub Actions CI/CD pipelines
└── README.md              # Project documentation
---

## Architecture & Workflow
```text
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                      GitHub                        β”‚
          β”‚  PRs / Commits  ─────────▢  Actions (Runner)       β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚  build & push image
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         AWS ECR (Registry)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚  pull image
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      AWS EC2 (Compute)                        β”‚
β”‚   Docker run  β–Ά  start API (FastAPI/Flask) + MLflow logging   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     MLflow Tracking Server                    β”‚
β”‚      local mlflow ui  or  remote (DagsHub / self-hosted)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

About

End-to-End ML project with MLflow for experiment tracking, data ingestion to deployment. Includes data validation, transformation, training, evaluation, Docker packaging, AWS EC2 deployment, and CI/CD via GitHub Actions & ECR.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published