Predict a student's Mathematics score from demographics and study-related inputs using a productionized ML pipeline
Built with Sklearn | Flask Web UI | Dockerized | Deployed on AWS (Beanstalk & ECR→EC2 via GitHub Actions)
- 🎯 Demo
- ✨ Features
- 📁 Project Structure
- 📊 Data
- 🤖 Model Overview
- 🚀 Quickstart
- ⚙️ Configuration
- 🌐 Routes / API
- 🔧 Training Pipeline
- 🎯 Inference Pipeline
- 📝 Logging & Errors
- 🔄 CI/CD (GitHub Actions → ECR → EC2)
- ☁️ Deployment on AWS Elastic Beanstalk
- 📸 Screenshots
- 🐛 Troubleshooting
- 🔒 Security & Cost Notes
- 🗺️ Roadmap
- 📄 License
- 🙏 Acknowledgements
🔗 Live URL: http://<your-ec2-public-ip>:8080/ (hosted on EC2, port 8080)
💡 See Screenshots for the homepage, form, and prediction result.
| Feature | Description |
|---|---|
| 🎯 Math Score Prediction | Predict from 7 inputs (gender, race/ethnicity, parental education, lunch, test prep, reading score, writing score) |
| 🧰 End-to-End Pipeline | Ingestion → Transform → Model Training → Evaluation → Persisted Artifacts |
| 🌐 Flask Web UI | Simple two-page flow (index & predict) |
| 🐳 Dockerized App | Easy to run locally or in the cloud |
| 🚀 Two AWS Deployments | • Elastic Beanstalk (Python platform) • GitHub Actions → Amazon ECR → EC2 self-hosted runner |
.
├─ 📂 .ebextensions/
│ └─ python.config # Beanstalk: WSGI path / platform opts (optional)
├─ 📂 .github/
│ └─ 📂 workflows/
│ └─ main.yaml # CI (build) + CD (push to ECR & run on EC2)
├─ 📂 artifacts/
│ ├─ data.csv # Raw snapshot used locally
│ ├─ train.csv, test.csv # Split datasets
│ ├─ preprocessor.pkl # Saved ColumnTransformer (OneHot + Scale + Impute)
│ └─ model.pkl # Trained best model serialized
├─ 📂 logs/ # (Optional) runtime/training logs
├─ 📂 notebook/ # (Optional) experiments
├─ 📂 src/
│ ├─ 📂 components/
│ │ ├─ data_ingestion.py # Read dataset, write artifacts/{raw,train,test}.csv
│ │ ├─ data_transformation.py # Build & persist sklearn preprocessor
│ │ ├─ model_trainer.py # Train/evaluate models, save best model
│ │ └─ 📂 artifacts/ # (component-specific outputs if any)
│ ├─ 📂 pipeline/
│ │ ├─ train_pipeline.py # (optional) training entrypoint
│ │ └─ predict_pipeline.py # Load artifacts & predict on new data
│ ├─ exception.py # CustomException with context
│ ├─ logger.py # Logging helper
│ └─ utils.py # save_object, evaluate_models, helpers
├─ 📂 templates/
│ ├─ index.html # Landing page
│ └─ home.html # Form + prediction result
├─ app.py # Flask app (Gunicorn entrypoint: `app:application`)
├─ Dockerfile # 3.11-slim base + gunicorn
├─ requirements.txt # Pinned libs compatible with train/infer
├─ setup.py # (optional) packaging
└─ README.mdThe dataset contains student demographics and study attributes with target math_score.
genderrace_ethnicityparental_level_of_educationlunchtest_preparation_coursereading_scorewriting_scoremath_score(target)
The pipeline performs an 80/20 train/test split (random_state=42) and persists train.csv, test.csv for reproducibility.
⚠️ Note: Use your own dataset or ensure you have the right to use and distribute it.
- Linear Regression, Lasso, Ridge
- K-Nearest Neighbors Regressor
- Decision Tree, Random Forest Regressor
- XGBRegressor (XGBoost)
- CatBoosting Regressor
Each model underwent comprehensive hyperparameter tuning using modular programming approach:
| Model | Hyperparameters Tuned |
|---|---|
| Ridge | alpha: [0.1, 0.5, 1.0, 5.0, 10.0] |
| Lasso | alpha: [0.001, 0.01, 0.1, 1.0] |
| Random Forest | n_estimators, max_depth, min_samples_split, min_samples_leaf |
| XGBoost | learning_rate, n_estimators, max_depth, subsample |
| CatBoost | iterations, learning_rate, depth |
| KNN | n_neighbors: [3, 5, 7, 9, 11] |
| Decision Tree | max_depth, min_samples_split, min_samples_leaf |
Tuning Strategy:
- GridSearchCV with 5-fold cross-validation
- Automated hyperparameter selection in
src/utils.py - Modular design allows easy parameter updates
| Model | Test R² | RMSE | Best Parameters |
|---|---|---|---|
| Ridge (Best) | 0.8806 | 5.39 | alpha=1.0 |
| Linear Regression | 0.8803 | - | Default |
| CatBoost | 0.852 | - | depth=6, iterations=100 |
| Random Forest | 0.847 | - | n_estimators=100, max_depth=10 |
| XGBoost | 0.822 | - | learning_rate=0.1, n_estimators=100 |
| KNN | 0.784 | - | n_neighbors=5 |
| Decision Tree | 0.760 | - | max_depth=5 |
artifacts/preprocessor.pkl— OneHotEncoder (categoricals) + StandardScaler (numericals), with imputersartifacts/model.pkl— Best model by test R² (with optimal hyperparameters)
Prerequisites: Python 3.11 recommended
# 1) Clone the repository
git clone https://github.com/<you>/Complete_ML_Project.git
cd Complete_ML_Project
# 2) Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3) Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 4) Run the application
python app.py
# OR using gunicorn:
# gunicorn -w 2 -k gthread -b 0.0.0.0:8080 app:application
# 5) Open http://localhost:8080# Build the image
docker build -t student-performance:latest .
# Run the container (container port 8080 → host 8080)
docker run --rm -p 8080:8080 student-performance:latest
# Open http://localhost:8080| Setting | Value | Description |
|---|---|---|
| Port | 8080 | Container binds to 0.0.0.0:8080 (change with -p HOST:CONTAINER) |
| Artifacts | artifacts/ |
Expects preprocessor.pkl and model.pkl at runtime |
| Env Vars | None | No environment variables required for basic usage |
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Home / landing page (templates/index.html) |
GET |
/predictdata |
Renders form (templates/home.html) |
POST |
/predictdata |
Accepts form inputs → returns predicted Math score |
Reads dataset → writes artifacts/data.csv, train.csv, test.csv
Numerical Features (reading_score, writing_score):
- SimpleImputer(median)
- StandardScaler
Categorical Features (gender, race_ethnicity, parental_level_of_education, lunch, test_preparation_course):
- SimpleImputer(most_frequent)
- OneHotEncoder
- Scaling
→ Persisted as artifacts/preprocessor.pkl
- Trains multiple regression models with hyperparameter tuning
- Uses GridSearchCV for optimal parameter selection
- Evaluates models using cross-validation
- Reports comprehensive metrics on train/test sets
- Implemented in modular fashion in
src/components/model_trainer.py
- Selects best model based on R² score on test set
- Ridge Regression selected (R² = 0.8806) with optimal hyperparameters
- Model saved with fitted parameters as
artifacts/model.pkl
PredictPipeline loads preprocessor.pkl & model.pkl, applies the same transforms, and returns a numeric Math score prediction.
Inputs from form:
genderrace_ethnicityparental_level_of_educationlunchtest_preparation_coursereading_scorewriting_score
src/logger.py— Standard logging with info statements around key pipeline stepssrc/exception.py— CustomException with filename/line/context to ease debugging
- CI: Lint/tests (placeholder)
- Build & Push: Docker image → Amazon ECR
- Deploy: Self-hosted runner on EC2 pulls and runs:
docker rm -f ml_project_container || true
docker run -d --name ml_project_container -p 8080:8080 <image-uri>| Secret | Description |
|---|---|
AWS_ACCESS_KEY_ID |
AWS access key |
AWS_SECRET_ACCESS_KEY |
AWS secret key |
AWS_REGION |
e.g., us-east-2 |
AWS_ACCOUNT_ID |
12-digit account ID |
ECR_REPOSITORY_NAME |
e.g., studentperformance |
- Labels:
self-hosted,Linux,X64 - Setup: Install Docker (
apt install -y docker.io), enable service, add runner user to docker group
Alternative to the ECR/EC2 pipeline
- Platform: Python 3.9/3.13 on Amazon Linux 2023
- App Code: Your repo zipped or linked via CodePipeline
.ebextensions/python.config example:
option_settings:
"aws:elasticbeanstalk:container:python":
WSGIPath: application:application💡 If your entry is
app.pywithapplication = Flask(__name__), setWSGIPath: app:application
| Issue | Solution |
|---|---|
| Invalid reference format during deploy | Ensure AWS_ACCOUNT_ID, AWS_REGION, ECR_REPOSITORY_NAME secrets are set |
| Cannot connect to Docker daemon on runner | Start Docker: sudo systemctl enable --now docker; add runner user to docker group |
| ECR auth errors | Ensure IAM policy includes ecr:GetAuthorizationToken and repo push/pull actions |
| Port not reachable | EC2 Security Group must allow TCP 8080; if ufw active: sudo ufw allow 8080/tcp |
- ✅ Prefer GitHub OIDC + IAM role over long-lived AWS keys
- ✅ Restrict SG ingress (ideally your IP only) or front app with load balancer/HTTPS reverse proxy
- ✅ Watch EC2/ECR costs; prune unused images, stop instances when idle
- Add HTTPS via Nginx/Caddy + Let's Encrypt on EC2
- Versioned image tags (
:sha-<GITHUB_SHA>) and blue/green deploy - Add tests + lint checks in CI
- Optional REST API endpoint for programmatic prediction
- Model monitoring and retraining pipeline
- Performance metrics dashboard
This project is licensed under the MIT License - see the LICENSE file for details.
- scikit-learn, XGBoost, CatBoost
- Flask & Jinja
- AWS (ECR/EC2/Beanstalk)
- GitHub Actions
- Docker
Made by Rroopesh Hari
⭐ Star this repo if you find it helpful!