This project conducts a comprehensive analysis of student performance using a dataset extracted from a MySQL database. It adheres to the complete data science lifecycle, encompassing data ingestion, preprocessing, model development, evaluation, and deployment. Advanced tools such as MLflow and DVC are utilized for experiment tracking and data versioning, ensuring a reproducible and efficient workflow. Version control is maintained via Git and GitHub, providing a clear audit trail of project progress through iterative commits.
The objective is to leverage machine learning algorithms to forecast academic performance, enabling proactive interventions and data-driven educational strategies.
![]() |
![]() |
![]() |
MLflow for experiment tracking | DVC for data version control | Pandas for data manipulation |
![]() |
![]() |
![]() |
NumPy for numerical computations | Matplotlib for data visualization | MySQL for data storage and retrieval |
- Introduction
- Technologies Used
- Data Ingestion
- Data Transformation
- Exploratory Data Analysis (EDA)
- Model Training
- Results
- DagsHub Experiments
- MLflow Tracking
- Conclusion
This project aims to implement predictive analytics to model student performance using machine learning. By adhering to a structured data science workflow, we systematically approach data handling, model development, and evaluation. The project serves as a practical application of machine learning methodologies in educational data mining.
Data was ingested from a MySQL database into a Pandas DataFrame. This process involved querying the database, handling data types, and ensuring consistency in the data structure for further analysis.
The transformation phase included rigorous data preprocessing. Tasks such as handling One Hot Encodng, normalizing data, and performing feature engineering were executed. This step is critical for enhancing model performance and ensuring data quality.
EDA was conducted using Matplotlib and Seaborn, focusing on statistical summaries and visualizations. Insights were drawn regarding data distribution, correlations, and potential anomalies, which guided the feature selection and model development process.
Various machine learning models were trained, including Linear Regression, Decision Trees, XGBregessor, Random Forest Regressor, AdaBoost, and CatBoost. A GridSearchCV was applied to all models for hyperparameter tuning to identify the best configuration for each. After evaluating the performance of all models, Linear Regression emerged as the best performer, delivering the highest accuracy and lowest error metrics among the tested algorithms.
Performance metrics of the Linear Regression model include:
- RMSE: 5.39
- R² Score: 0.88
- MAE: 4.21
These metrics underscore the model's predictive accuracy and robustness in handling the dataset.
DagsHub was utilized for experiment tracking and collaborative development. This platform enabled efficient version control and seamless collaboration among team members.
The project employed MLflow for comprehensive experiment tracking. Below are the configuration details for MLflow:
- MLFLOW_TRACKING_URI:
https://dagshub.com/38832/mlproject.mlflow
- MLFLOW_TRACKING_USERNAME:
38832
- MLFLOW_TRACKING_PASSWORD:
ed5a6942f3480d84b1bbd6bfccba8e3c5fbc9195
MLflow ensured a streamlined tracking process, capturing all model parameters, metrics, and artifacts, thereby facilitating reproducibility and transparency.
The project provided a detailed analysis of factors influencing student performance and demonstrated the applicability of machine learning in educational settings. The insights gained and the models developed can be leveraged for targeted interventions and strategic decision-making in educational institutions. Future work will focus on expanding the dataset and integrating additional predictive features to further enhance model accuracy.