Predict final student grades (G3) from demographic, school, and historical grade features.
- About
- Features
- Repository Structure
- Dataset
- Quickstart
- Preprocessing & Modeling
- Evaluation
- API
- Demo / Example Requests
- Deployment
- Roadmap & Improvements
- Contributing
- License
- Contact
This project trains and ships a production-ready machine learning pipeline to predict final student grades (G3) using the UCI Student Performance dataset. The delivered artifact is a saved sklearn Pipeline (preprocessing + model) and a FastAPI application for inference.
Key goals:
- Reproducible preprocessing with
ColumnTransformerandPipeline - Robust baseline and ensemble models (Linear Regression, Random Forest)
- Clear evaluation and model persistence
- Lightweight FastAPI inference endpoint
- End-to-end pipeline: data → preprocessing → training → evaluation → model saving
- Preprocessing implemented with
SimpleImputer,StandardScaler, andOneHotEncoder - Baseline and Random Forest regression models
- Saved sklearn pipeline for zero-drift inference
- FastAPI server with Pydantic input validation
student-performance-prediction/
├── data/
│ └── raw/ # Raw data files
│
├── models/ # Saved models
│ └── student_performance_rf.pkl
├── notebooks/ # EDA and experiments
│ └── 01_eda.ipynb
├── src/ # Source code modules
│ ├── __init__.py
│ ├── evaluate.py
│ ├── predict.py
│ ├── preprocessing.py
│ └── train.py
├── main.py # FastAPI application entry point
├── pyproject.toml # Project configuration
├── requirements.txt # Dependencies
├── uv.lock # Dependency lock file
├── .gitignore
├── .python-version
└── README.md
- Source: UCI Student Performance dataset (Kaggle mirror recommended)
- Files:
student-mat.csv(semicolon-separated) - Target column:
G3(final grade, range 0–20)
Note: keep raw data under data/raw/ and never commit sensitive/raw files to public repos.
- Create and activate a virtual environment
python -m venv .venv
# mac / linux
source .venv/bin/activate
# windows (powershell)
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt- Place
student-mat.csvindata/raw/ - Run training (example)
python src/train.py --data_path data/raw/student-mat.csv --output models/student_performance_rf.pkl- Start the API
fastapi dev main.py
# open http://127.0.0.1:8000/docs- Numerical features: median imputation +
StandardScaler - Categorical features: most-frequent imputation +
OneHotEncoder(handle_unknown='ignore') - Preprocessing implemented via
build_preprocessor(cat_cols, num_cols)insrc/preprocessing.py - Models available in
src/train.py(Linear Regression baseline, Random Forest)
Model evaluation scripts (src/evaluate.py) produce MAE, RMSE and R² metrics.
Example baseline results (expected range):
- MAE ≈ 1.2–1.8
- RMSE ≈ 2.0–2.5
- R² ≈ 0.6–0.85
The FastAPI application loads the serialized sklearn Pipeline and exposes a /predict endpoint.
Input: JSON matching app/schema.py (Pydantic model)
Output: `{ "predicted_G3": float }
Example request payload (replace with realistic values):
- The High-Achiever (Positive Test)
output: 18 - 20
{
"school": "GP",
"sex": "M",
"age": 17,
"address": "U",
"famsize": "LE3",
"Pstatus": "T",
"Medu": 4,
"Fedu": 4,
"Mjob": "health",
"Fjob": "services",
"reason": "reputation",
"guardian": "mother",
"traveltime": 1,
"studytime": 4,
"failures": 0,
"schoolsup": "no",
"famsup": "yes",
"paid": "yes",
"activities": "yes",
"nursery": "yes",
"higher": "yes",
"internet": "yes",
"romantic": "no",
"famrel": 5,
"freetime": 2,
"goout": 2,
"Dalc": 1,
"Walc": 1,
"health": 5,
"absences": 0,
"G1": 18,
"G2": 19,
"Dalc": 1,
"Walc": 1,
"health": 5,
"absences": 0,
"G1": 18,
"G2": 19
}- The At-Risk Student (Negative Test)
output: 0 - 6
{
"school": "MS",
"sex": "M",
"age": 19,
"address": "R",
"famsize": "GT3",
"Pstatus": "T",
"Medu": 1,
"Fedu": 1,
"Mjob": "other",
"Fjob": "other",
"reason": "course",
"guardian": "other",
"traveltime": 3,
"studytime": 1,
"failures": 3,
"schoolsup": "no",
"famsup": "no",
"paid": "no",
"activities": "no",
"nursery": "no",
"higher": "no",
"internet": "no",
"romantic": "yes",
"famrel": 2,
"freetime": 4,
"goout": 5,
"Dalc": 3,
"Walc": 4,
"health": 2,
"absences": 20,
"G1": 5,
"G2": 4
}- The "Average" Student (Boundary Test)
output: 10 - 12
{
"school": "GP",
"sex": "F",
"age": 16,
"address": "U",
"famsize": "GT3",
"Pstatus": "T",
"Medu": 2,
"Fedu": 2,
"Mjob": "services",
"Fjob": "other",
"reason": "home",
"guardian": "father",
"traveltime": 1,
"studytime": 2,
"failures": 0,
"schoolsup": "yes",
"famsup": "yes",
"paid": "no",
"activities": "yes",
"nursery": "yes",
"higher": "yes",
"internet": "yes",
"romantic": "no",
"famrel": 4,
"freetime": 3,
"goout": 3,
"Dalc": 1,
"Walc": 2,
"health": 4,
"absences": 6,
"G1": 11,
"G2": 10
}
- Hyperparameter tuning (Grid / Random / Bayesian)
- Cross-validation & CI checks
- Model explainability (SHAP) and fairness checks
- Monitoring: latency, error-rate, prediction drift
- Add unit & integration tests for API
Contributions are welcome. Please open an issue or PR. Follow the code style and add tests for new functionality.
MIT License — see LICENSE file.
Ali Sulman — https://github.com/alisulmanpro