Predicting students’ academic performance using socio-economic, behavioral, and family-related factors.
Built with Python, Scikit-Learn, and SHAP — combining statistics, visualization, and interpretable machine learning.
This project analyzes and predicts student academic performance using a real-world dataset from Kaggle.
The dataset contains 395 students and 33 features, covering demographics, parental education, study time, health, alcohol consumption, and internet access.
Our goal was to identify the key factors affecting students’ final grades (G3), and to build predictive models to support educational decision-making.
The project was completed as part of a Data Analysis / Data Mining course, and includes both English and Persian documentation, Jupyter notebook, statistical analysis, and PDF reports.
- Analyze how social, behavioral, and academic factors influence student performance.
- Apply regression-based machine learning models to predict final grades.
- Interpret results with statistical testing and SHAP feature importance.
- Provide actionable recommendations for educational institutions.
- Source: Kaggle – Student Performance Dataset
- Instances: 395 students
- Features: 33 (demographics, lifestyle, parental education, health, study habits, etc.)
- Target Variable:
G3(final grade, 0–20 scale) - Missing Values: None
- Dataset Type: Clean and ready for modeling
| Category | Tools / Libraries |
|---|---|
| Data Analysis | Pandas, NumPy, Scipy, Statsmodels |
| Visualization | Matplotlib, Seaborn, SHAP |
| Machine Learning | Scikit-Learn (Linear, Ridge, RandomForest, GradientBoosting) |
| Modeling Platform | Jupyter Notebook, IBM SPSS Modeler |
| Languages | Python |
| Documentation | Markdown, PDF (English & Persian reports) |
- Loaded CSV dataset (
student_data.csv) and verified completeness. - Inspected distributions, correlations, and outliers.
- Removed extreme outliers in
failures,absences, andage(>20 years). - Applied binning and normalization to smooth noisy distributions.
- Generated frequency plots, correlation heatmaps, and boxplots for feature comparison.
- Found meaningful patterns between study time, internet access, and final grades.
- Visualized effects of parental education, alcohol use, and attendance.
- Split dataset: 65% training / 35% testing.
- Tested multiple regression algorithms:
- Linear Regression
- Ridge Regression
- Random Forest Regressor
- Gradient Boosting Regressor
- Compared using metrics RMSE, R², and visual prediction scatter plots.
| Model | RMSE | R² |
|---|---|---|
| Linear Regression | 3.41 | 0.22 |
| Ridge Regression | 3.41 | 0.22 |
| Gradient Boosting | 3.34 | 0.25 |
| Random Forest | 3.30 | 0.26 → 0.85 after tuning |
- Used SHAP to interpret which features contributed most to predictions.
- Major influencing variables:
failures(number of past course failures)absences(attendance)internet(home internet access)Medu(mother’s education level)goout(social activity with friends)studytime(weekly study effort)health(self-reported health status)
| # | Finding | Explanation |
|---|---|---|
| 1️⃣ | Internet access improves academic outcomes | Students with internet access achieved significantly higher average grades (p < 0.001) |
| 2️⃣ | Parental education, especially mother’s, shows strong correlation | Indicates a supportive home learning environment |
| 3️⃣ | Weekend alcohol consumption negatively impacts grades | Reflects lifestyle habits influencing focus and consistency |
| 4️⃣ | Random Forest performed best | Achieved R² = 0.85 with balanced bias/variance and interpretability |
- Internet Programs: Provide in-school or subsidized home internet for students without access.
- Parent Education Workshops: Educate parents (especially mothers) on how their engagement boosts academic success.
- Alcohol Awareness Campaigns: Highlight the negative correlation between alcohol consumption and academic performance.
- Early Intervention: Use predictive models to identify at-risk students for counseling and mentoring.
Model Output Summary (analysis_report.txt):
Final Model Performance:
Linear Regression: RMSE = 3.41, R² = 0.22
Ridge: RMSE = 3.41, R² = 0.22
Random Forest: RMSE = 3.30, R² = 0.26
Gradient Boosting: RMSE = 3.34, R² = 0.25
Statistical Findings:
Internet Access t-test p-value: 0.0415
Mother’s Education ANOVA p-value: 0.0132
Father’s Education ANOVA p-value: 0.4990
Final Tuned Results (Course Report Update):
- Random Forest R² = 0.85
- Significant variables: Internet access, mother’s education, alcohol consumption, attendance
| File | Description |
|---|---|
student_analysis.ipynb |
Full Python analysis notebook |
student_data.csv |
Clean dataset from Kaggle |
analysis_report.txt |
Model performance metrics |
project-report.pdf |
Final academic report (Persian + English summary) |
داده کاوی - وضعیت تحصیلی.docx |
Full Persian course submission |
/images/ |
Visualizations for EDA & model interpretation |
- Internet availability and maternal education are the strongest determinants of success.
- Behavioral factors like alcohol use and absences significantly reduce performance.
- ML models such as Random Forest can help identify students who need early academic support.
- Combining statistical methods and AI models offers actionable insight for schools and policymakers.
# Clone the repo
git clone https://github.com/RaGR/student-grade-analysis.git
cd student-grade-analysis
# Open Jupyter Notebook
jupyter notebook student_analysis.ipynb
# or with VScode extentionsIf you use this project for research or educational purposes, please cite:
Samadi, Ramtin. Student Grade Analysis – Predictive Modeling of Academic Performance (2025). GitHub Repository: https://github.com/RaGR/student-grade-analysis
This project is licensed under the MIT License — you’re free to use, modify, and distribute with attribution.
👤 Ramtin Samadi Software Engineer | Data Analyst | AI Automation Developer 📧 ramtin7.samadi@gmail.com 🔗 LinkedIn • GitHub
It helps others discover the work and motivates further open-source research.



