End-to-End Data Science with MLflow

Production-ready MLOps template: experiment tracking with MLflow, modular pipelines, and AWS CI/CD (ECR, EC2, GitHub Actions self-hosted runner).

📌 Table of Contents

Project Overview
Architecture & Workflow
Folder Structure
Tech Stack
Quick Start (Local)
Experiment Tracking (MLflow)
Remote Tracking on DagsHub
CI/CD on AWS
Git Connectivity
Configuration Files
Pipelines & Components
Make It Yours
FAQ

Project Overview

This repository demonstrates a clean, modular, and reproducible machine-learning pipeline:

Config-first design (config.yaml, params.yaml, schema.yaml)
Core pipeline stages: Data Ingestion → Validation → Transformation → Model Training → Evaluation
MLflow for experiment tracking, params, metrics, and artifacts
Docker packaging and AWS deployment via ECR + EC2
Optional GitHub Actions (self-hosted runner) for end-to-end CI/CD

📦 Pipeline Stages

📁 Folder structure creation (scaffold)
📥 Data Ingestion (download/sync raw data)
🧪 Data Validation (schema checks, null handling, range validation)
⚙️ Data Transformation (feature engineering, encoding, splitting, scaling)
🤖 Model Training (baseline & hyperparameter tuned models)
📊 Model Evaluation (metrics, drift detection hooks)
📌 Model Tracker: MLflow (parameters, metrics, artifacts, versions)
🐳 Model Packaging (Docker containerization)
🚀 Model Deployment (AWS EC2 instance)
🔄 CI/CD (GitHub Actions → ECR → EC2 via self-hosted runner)

Tech Stack

Language: Python 3.8+
Core: MLflow, scikit‑learn, pandas, numpy
Serving: Flask/FastAPI (via app.py)
MLOps: Docker, GitHub Actions, AWS (ECR, EC2, IAM)

🚀 Quick Start (Local)

Clone git clone https://github.com/dev618/End-to-end-Machine-Learning-Project-with-MLflow cd End-to-end-Machine-Learning-Project-with-MLflow

Conda env conda create -n mlproj python=3.8 -y conda activate mlproj

Install deps pip install -r requirements.txt

Run the web app python app.py

Open your browser at the printed localhost:PORT.

Run the training pipeline python main.py

📊 Experiment Tracking (MLflow)

Local UI

mlflow ui Open the printed URL to explore runs (params, metrics, artifacts). Log from code: the pipeline components log_params, log_metrics, and log_artifacts are called for each stage.

Remote Tracking on DagsHub

Docs: https://dagshub.com/ Project page: https://dagshub.com/dev618/End-To-End-Data-Science-with-MLFlow.mlflow

Set environment variables (PowerShell example): $env:MLFLOW_TRACKING_URI="https://dagshub.com/dev618/End-To-End-Data-Science-with-MLFlow.mlflow " $env:MLFLOW_TRACKING_USERNAME="<YOUR_DAGSHUB_USERNAME>" $env:MLFLOW_TRACKING_PASSWORD="<YOUR_DAGSHUB_TOKEN>"

⚠️ Do not commit secrets. Prefer GitHub Encrypted Secrets or local env vars.

⚡ CI/CD on AWS

1)AWS Console & IAM Create an IAM user with programmatic access and attach: AmazonEC2ContainerRegistryFullAccess AmazonEC2FullAccess

2)Create an ECR Repository Example URI: 549328952286.dkr.ecr.us-east-1.amazonaws.com/mlproj

3)Provision an EC2 (Ubuntu) host Install Docker: sudo apt-get update -y && sudo apt-get upgrade -y curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker ubuntu newgrp docker

4)Configure Self-Hosted Runner GitHub → Settings → Actions → Runners → New self-hosted runner (Linux) and run the displayed commands on EC2.

5)GitHub Secrets

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=us-east-1
AWS_ECR_LOGIN_URI=549328952286.dkr.ecr.us-east-1.amazonaws.com
ECR_REPOSITORY_NAME=mlproj

6)Sample GitHub Actions (build & push)

name: ci-cd on: [push] jobs: build-and-push: runs-on: self-hosted steps:

uses: actions/checkout@v4
name: Configure AWS creds uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: ${{ secrets.AWS_REGION }}
name: Login to ECR id: login-ecr uses: aws-actions/amazon-ecr-login@v2
name: Build, tag, and push image run: | IMAGE_URI=${{ secrets.AWS_ECR_LOGIN_URI }}/${{ secrets.ECR_REPOSITORY_NAME }}:$(git rev-parse --short HEAD) docker build -t $IMAGE_URI . docker push $IMAGE_URI
name: Deploy on EC2 run: | IMAGE_URI=${{ secrets.AWS_ECR_LOGIN_URI }}/${{ secrets.ECR_REPOSITORY_NAME }}:$(git rev-parse --short HEAD) docker pull $IMAGE_URI docker stop mlproj || true && docker rm mlproj || true docker run -d --name mlproj -p 80:80 -e MLFLOW_TRACKING_URI -e MLFLOW_TRACKING_USERNAME -e MLFLOW_TRACKING_PASSWORD $IMAGE_URI

🔗 Git Connectivity

git init git remote add origin https://github.com/dev618/End-to-end-Machine-Learning-Project-with-MLflow.git git checkout -b main git add . && git commit -m "init: project scaffold" git push -u origin main

⚙️ Configuration Files

config.yaml – Paths, uris, data locations, artifact directories schema.yaml – Data contracts used in validation (dtypes, ranges, required cols) params.yaml – Hyperparameters (splits, model params, thresholds)

Workflow to update:

Update config.yaml schema.yaml params.yaml Update entities Config manager (src/config) Components Pipeline main.py app.py

🛠 Pipelines & Components

Entities & Config Manager

Strongly-typed dataclasses for each stage (inputs/outputs/paths) Centralized loader to read YAMLs and expose stage configs

Components

data_ingestion.py – fetch/copy raw data into artifacts/ data_validation.py – validate vs schema.yaml (required columns, dtypes, NA) data_transformation.py – split train/test, encode, scale, feature build model_trainer.py – train baseline + tuned models, save under artifacts/model/ model_evaluation.py – compute metrics; log plots & confusion matrices

📈 MLflow Tracking

Logs: params, metrics, artifacts for each component Model registry (optional): promote staging → production

🚀 Deployment

app.py exposes prediction API/UI Dockerfile containerizes the app (copy model artifacts + code) GitHub Actions builds and ships the image to ECR; EC2 pulls & runs

📂 Folder Structure

End-to-End-Data-Science-with-MLFlow/
│
├── config/                # YAML files for config, schema, params
├── src/                   # Core source code
│   ├── components/        # Data ingestion, validation, transformation, etc.
│   ├── pipeline/          # Training, evaluation pipelines
│   ├── config/            # Configuration manager
│   └── entity/            # Data entities
│
├── artifacts/             # Generated artifacts (data, models, logs)
├── main.py                # Orchestration script
├── app.py                 # Flask/FastAPI app for inference
├── requirements.txt       # Project dependencies
├── Dockerfile             # Docker build file
├── .github/workflows/     # GitHub Actions CI/CD pipelines
└── README.md              # Project documentation
---

## Architecture & Workflow
```text
          ┌────────────────────────────────────────────────────┐
          │                      GitHub                        │
          │  PRs / Commits  ─────────▶  Actions (Runner)       │
          └────────────────────────────────────────────────────┘
                           │  build & push image
                           ▼
┌───────────────────────────────────────────────────────────────┐
│                         AWS ECR (Registry)                    │
└───────────────────────────────────────────────────────────────┘
                           │  pull image
                           ▼
┌───────────────────────────────────────────────────────────────┐
│                      AWS EC2 (Compute)                        │
│   Docker run  ▶  start API (FastAPI/Flask) + MLflow logging   │
└───────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌───────────────────────────────────────────────────────────────┐
│                     MLflow Tracking Server                    │
│      local mlflow ui  or  remote (DagsHub / self-hosted)      │
└───────────────────────────────────────────────────────────────┘

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflow		.github/workflow
config		config
research		research
src/mlProject		src/mlProject
static		static
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
instruction.txt		instruction.txt
main.py		main.py
params.yaml		params.yaml
requirements.txt		requirements.txt
schema.yaml		schema.yaml
setup.py		setup.py
template.py		template.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End Data Science with MLflow

Tags

📌 Table of Contents

Project Overview

📦 Pipeline Stages

Tech Stack

🚀 Quick Start (Local)

📊 Experiment Tracking (MLflow)

⚡ CI/CD on AWS

🔗 Git Connectivity

⚙️ Configuration Files

Workflow to update:

🛠 Pipelines & Components

Entities & Config Manager

Components

📈 MLflow Tracking

🚀 Deployment

📂 Folder Structure

About

Uh oh!

Releases

Packages

Languages

License

dev618/End-To-End-Data-Science-with-MLFlow

Folders and files

Latest commit

History

Repository files navigation

End-to-End Data Science with MLflow

Tags

📌 Table of Contents

Project Overview

📦 Pipeline Stages

Tech Stack

🚀 Quick Start (Local)

📊 Experiment Tracking (MLflow)

⚡ CI/CD on AWS

🔗 Git Connectivity

⚙️ Configuration Files

Workflow to update:

🛠 Pipelines & Components

Entities & Config Manager

Components

📈 MLflow Tracking

🚀 Deployment

📂 Folder Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages