Student Performance Prediction – End-to-End ML Project

This project predicts students’ math performance based on demographic and academic attributes such as gender, parental education, lunch type, test preparation, and reading/writing scores.
The dataset is sourced from Kaggle – Students Performance in Exams.

Objective

To build a machine learning pipeline that can:

Clean and preprocess the data
Perform feature engineering and transformation
Train multiple ML models and optimize hyperparameters
Save and deploy the best-performing model using a scalable and reusable pipeline

Project Structure

ML_Project_1/
│
├── artifacts/                     # Stores serialized models and processed data
│   ├── preprocessor.pkl
│   ├── train.csv
│   ├── test.csv
│   ├── raw.csv
│
├── logs/                          # Application logs
│
├── notebook/                      # Jupyter notebooks for EDA & experimentation
│   ├── 1. EDA STUDENT PERFORMANCE.ipynb
│   ├── 2. MODEL TRAINING.ipynb
│   └── data/data.csv
│
├── src/
│   ├── components/                # Core ML components
│   │   ├── data_ingestion.py
│   │   ├── data_transformation.py
│   │   ├── model_trainer.py             (to be added)
│   │   └── model_hyperparameter_tuning.py (to be added)
│   │
│   ├── pipeline/                  # Training & prediction pipelines
│   │   ├── train_pipeline.py
│   │   ├── predict_pipeline.py
│   │   └── __init__.py
│   │
│   ├── logs/                      # Logging config files
│   ├── exception.py               # Custom exception handling
│   ├── logger.py                  # Logging utilities
│   ├── utils.py                   # Helper functions (e.g., model saving/loading)
│   └── __init__.py
│
├── venv/                          # Virtual environment
│
├── .gitignore
├── requirements.txt               # Required dependencies
├── setup.py                       # Package configuration
└── README.md

Tech Stack

Language: Python 3.10+
Libraries: numpy, pandas, scikit-learn, matplotlib, seaborn, joblib
Frameworks: Flask (for deployment)
Cloud: AWS EC2, Azure Container Instance (planned)
Version Control: Git + GitHub
Environment: Virtual Environment / Conda

Key Modules

1️⃣ Data Ingestion

Loads raw data from notebook/data/data.csv
Splits data into train/test sets
Saves the processed datasets in artifacts/

2️⃣ Data Transformation

Handles missing values with SimpleImputer
Encodes categorical features using OneHotEncoder
Scales features using StandardScaler
Saves preprocessor object (preprocessor.pkl)

3️⃣ Model Trainer (upcoming)

Trains multiple ML models (e.g., Linear Regression, RandomForest, XGBoost)
Evaluates metrics (R², RMSE, MAE)
Saves the best-performing model

4️⃣ Hyperparameter Tuning (upcoming)

Uses GridSearchCV or RandomizedSearchCV for model optimization

5️⃣ Prediction Pipeline (upcoming)

Loads saved preprocessor and model to predict unseen data

Training the Pipeline

# Step 1: Activate environment
venv\Scripts\activate

# Step 2: Install dependencies
pip install -r requirements.txt

# Step 3: Run Data Ingestion
python src/components/data_ingestion.py

# Step 4: Run Data Transformation
python src/components/data_transformation.py

# Step 5: Run Model Training (when implemented)
python src/components/model_trainer.py

Deployment (Planned)

AWS EC2

Containerize the application using Docker
Deploy the Flask app and trained ML model on an EC2 instance
Use NGINX or Gunicorn for serving the production app

Azure Container Instance

Deploy using Azure CLI or the Azure Portal
Build Docker image and push it to Azure Container Registry (ACR)
Run and scale the containerized app directly on Azure

Results

Model	R² Score	RMSE	MAE
Linear Regression	0.88	5.40	4.22
Lasso	0.83	6.52	5.16

Utilities

Custom Logging: Provides detailed tracking of every step in the ML workflow
Custom Exception Handling: Ensures robust and clean error management
Reusable Pipelines: Modularized preprocessing and model training pipelines for flexibility

Author

Mayank Meghwal Data Scientist | Machine Learning Engineer

Email: mayankmeg207@gmail.com GitHub: itz-Mayank

Future Enhancements

Implement CI/CD pipeline with GitHub Actions
Automate deployment using Docker and Kubernetes
Integrate model monitoring and automated retraining system
Add support for multi-cloud deployment (AWS + Azure + GCP)

License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.ebextensions		.ebextensions
.github/workflows		.github/workflows
ML_Project1.egg-info		ML_Project1.egg-info
artifacts		artifacts
catboost_info		catboost_info
notebook		notebook
src		src
templates		templates
.gitignore		.gitignore
README.md		README.md
application.py		application.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Student Performance Prediction – End-to-End ML Project

Objective

Project Structure

Tech Stack

Key Modules

1️⃣ Data Ingestion

2️⃣ Data Transformation

3️⃣ Model Trainer (upcoming)

4️⃣ Hyperparameter Tuning (upcoming)

5️⃣ Prediction Pipeline (upcoming)

Training the Pipeline

Deployment (Planned)

AWS EC2

Azure Container Instance

Results

Utilities

Author

Future Enhancements

License

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

itz-Mayank/ML_Project1

Folders and files

Latest commit

History

Repository files navigation

Student Performance Prediction – End-to-End ML Project

Objective

Project Structure

Tech Stack

Key Modules

1️⃣ Data Ingestion

2️⃣ Data Transformation

3️⃣ Model Trainer (upcoming)

4️⃣ Hyperparameter Tuning (upcoming)

5️⃣ Prediction Pipeline (upcoming)

Training the Pipeline

Deployment (Planned)

AWS EC2

Azure Container Instance

Results

Utilities

Author

Future Enhancements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages