The data used for this analysis is sourced from the Student_performance_data_.csv file, originally referenced in the Kaggle notebook: https://www.kaggle.com/code/annastasy/predicting-students-grades/input
This dataset is a widely available anonymized compilation of 2,392 records (Student ID from 1001 to 3392), often associated with secondary school student data (potentially sourced from a U.S. school context). It contains key attributes related to student performance, including demographics, social factors, study habits, and final grades.
This section details all 15 columns, including the necessary code mappings for the analysis.
| Feature | Description | Range / Mapping |
|---|---|---|
| StudentID | A unique identifier assigned to each student. | 1001 to 3392 |
| Age | The age of the students. | 15 to 18 years |
| Gender | Gender of the students (Binary). | 0: Male, 1: Female |
| Ethnicity | The ethnicity of the students (Categorical). | 0: Caucasian, 1: African American, 2: Asian, 3: Other |
| Feature | Description | Mapping |
|---|---|---|
| ParentalEducation | The education level of the parents (Ordinal). | 0: None, 1: High School, 2: Some College, 3: Bachelor's, 4: Higher |
| ParentalSupport | The level of parental support (Ordinal). | 0: Very Low, 1: Low, 2: Moderate, 3: High, 4: Very High |
| Feature | Description | Range / Mapping |
|---|---|---|
| StudyTimeWeekly | Weekly study time in hours (Continuous). | 0 to 20 hours |
| Absences | Number of absences during the school year. | 0 to 30 |
| Tutoring | Tutoring status (Binary). | 0: No, 1: Yes |
| Extracurricular | Participation in extracurricular activities (Binary). | 0: No, 1: Yes |
| Sports | Participation in sports (Binary). | 0: No, 1: Yes |
| Music | Participation in music activities (Binary). | 0: No, 1: Yes |
| Volunteering | Participation in volunteering (Binary). | 0: No, 1: Yes |
| Feature | Description | Mapping |
|---|---|---|
| GPA | Grade Point Average on a continuous scale. | 0.0 to 4.0 |
| GradeClass | Target Variable: Classification of final grades based on GPA (Ordinal). |
0: 'A' (GPA |
The raw CSV file presented issues upon initial loading due to the US/UK format (dot . as decimal separator) conflicting with local Excel settings. This caused the numerical columns (StudyTimeWeekly and GPA) to display incorrect, long values.
The data was corrected using the "Text to Columns" feature, specifying the dot as the decimal separator, and the file was saved as Student_performance_data_visual.xlsx for visual inspection.
Figure 1: Initial load of the raw CSV in Excel, showing formatting issues in numerical columns before correction.
For maximum readability and initial analysis in Excel, the numeric codes (e.g., 0, 1, 2) in columns like Ethnicity, ParentalEducation, Tutoring, and GradeClass were manually converted to their corresponding text labels (Caucasian, High School, Yes, A, etc.).
This conversion was performed using IF or IFS formulas (e.g., =PIΓ.SE(...)) directly in the spreadsheet columns. The result was saved as Student_performance_data_visual_text.xlsx.
Figure 2: Data view after manual decoding of categorical variables in Excel, providing human-readable labels.
β οΈ Note on Process & Next Steps: Executing these numerous conversions using Excel formulas is a manual, time-consuming, and inflexible process. This approach is not suitable for automation or large-scale data analysis.Therefore, for the upcoming analysis in Pandas, we will switch back to the original
Student_performance_data.csvfile. We will implement the same decoding and transformation logic programmatically using Python's Pandas library (.map(),.replace(), etc.). This method is faster, scalable, and fully reproducible.
To ensure the analysis is scalable and fully reproducible, all subsequent steps use the original Student_performance_data.csv file and the Python library Pandas.
The numeric codes (e.g., 0, 1, 2) from the raw data are transformed into human-readable text labels using Pandas' .map() function and Conversion Dictionaries. This process creates five new decoded columns while preserving the original numeric data.
| Original Column | Conversion Type | New Decoded Column |
|---|---|---|
| Gender | {0: 'Male', 1: 'Female'} |
Gender_Decoded |
| Ethnicity | {0: 'Caucasian', ...} |
Ethnicity_Decoded |
| ParentalEducation | {0: 'None', ...} |
Education_Decoded |
| Tutoring | {0: 'No', 1: 'Yes'} |
Tutoring_Decoded |
| GradeClass | {0: 'A', 4: 'F'} |
GradeClass_Decoded |
The Python code below defines the conversion logic and applies it to the DataFrame:
# Load CSV
df = pd.read_csv("Student_performance_data.csv")
# Conversion Dictionaries defined here...
# Apply Conversions
df['Ethnicity_Decoded'] = df['Ethnicity'].map(ethnicity_conversion)
# ... other conversions applied ...The following table summarizes the raw count of students for each ethnicity category, based on the 'Ethnicity_Decoded' column:
| Ethnicity Category | Number of Students (Count) |
|---|---|
| Caucasian | 1207 |
| African American | 493 |
| Asian | 470 |
| Other | 222 |
| Total Students | 2392 |
The analysis shows a clear imbalance (skew) in the dataset, with the Caucasian group representing over half of the student population.
Figure 3: Distribution of student population by ethnicity. The chart visually confirms that the majority of students are in the Caucasian category, highlighting a demographic bias in the dataset.
To gain a deeper understanding of the dataset's composition, the student population count is segmented by both Ethnicity and Gender. This analysis reveals the distribution of males and females within each ethnic group.
The table below shows the exact breakdown of the student population across the two demographic dimensions:
| Ethnicity Category | Male (Count) | Female (Count) | Total |
|---|---|---|---|
| Caucasian | 598 | 609 | 1207 |
| African American | 240 | 253 | 493 |
| Asian | 238 | 232 | 470 |
| Other | 115 | 107 | 222 |
| Total Students | 1191 | 1201 | 2392 |
The total number of male (1191) and female (1201) students is nearly equal, indicating gender balance within the overall dataset.
The chart is generated using the corrected Python code, which first enforces the desired X-axis order and then applies custom labels and colors to the gender segments.
Figure 4: Distribution of the student population segmented by Ethnicity and Gender. The chart confirms that the gender ratio is relatively balanced across all ethnic categories.
This section details the initial analysis of the student Grade Point Average (GPA) and its distribution across gender and ethnic groups.
The Python code first calculates the overall GPA statistics and then segments the minimum, maximum, and mean GPA by gender.
Overall GPA:
- Min GPA: 0.0
- Max GPA: 4.0
- Mean GPA: 1.906
Segmented GPA Statistics:
| Statistic | Female | Male |
|---|---|---|
| Min GPA | 0.0 | 0.0 |
| Max GPA | 4.0 | 4.0 |
| Mean GPA | 1.894 | 1.919 |
Summary: While the range is identical, the Male group exhibits a slightly higher mean GPA (
$1.919$ ) compared to the Female group ($1.894$ ). The overall mean GPA suggests the data set leans towards a lower academic performance range.
To understand variations, the average GPA was segmented by combining both Ethnicity and Gender.
| Ethnicity | Gender | Average GPA |
|---|---|---|
| African American | Female | 1.8915 |
| Male | 1.9994 | |
| Asian | Female | 1.9467 |
| Male | 1.8954 | |
| Caucasian | Female | 1.8697 |
| Male | 1.8823 | |
| Other | Female | 1.9181 |
| Male | 1.9825 |
Key Finding: Males achieved a higher average GPA across almost all ethnic groups, with the Asian group being the sole exception where Females outperformed Males.
The segmented data is visualized using a grouped bar chart.
Figure 5: Average GPA segmented by Ethnicity and Gender. This visualization highlights the performance differences across subgroups.
This section visualizes the distribution of student performance across the five GradeClass categories ('A' to 'F'). The values are presented as relative frequencies (percentages), providing a clear overview of the performance breakdown in the entire dataset.
Comment: The distribution reveals a strongly left-skewed profile, indicating that the majority of students fall into the lower performance categories. Specifically, Grade F represents the largest segment, accounting for 50.7% of the student body. This is followed by Grade D at 17.2% and Grade C at 16.3%.
The highest performance categories are significantly smaller, with Grade B at 11.2% and Grade A representing only 4.6% of the population. This initial view highlights a critical performance gap that requires deeper analysis into correlating factors such as weekly study time and parental education.
This section analyzes the relationship between the continuous variable Weekly Study Time (StudyTimeWeekly) and the primary academic performance metric, the Grade Point Average (GPA).
Figure 7: Scatter plot showing the distribution of student GPA scores relative to their reported weekly study time.
The analysis of the scatter plot reveals a crucial insight regarding academic performance in this dataset:
There is no clear positive correlation between the amount of time a student spends studying weekly and their final Grade Point Average. The dispersion of the data points is high, indicating that even students who study many hours can obtain low GPAs.
The fact that the scatter plot does not show a distinct upward trend (and shows several students with high study time and low GPA) strongly suggests that weekly study time is not the single, determining factor of academic performance in this specific dataset.
The observation that students invest significant time in studying yet yield poor results suggests that other variables are playing a more influential role. Here are the possible contributing factors:
- Quality of Study vs. Quantity: The
StudyTimeWeeklyfeature measures the quantity of hours dedicated to studying, not the effectiveness or quality of that time. A student might study 20 hours a week inefficiently, while another studies 10 hours with targeted focus. - Systemic Factors and Support (Reverse Causality):
- Tutoring (
Tutoring_Decoded): A high study time may be an effect, not a cause. The student might be devoting many hours to study because they are struggling academically and are desperately trying to catch up. The low GPA is, in this case, the reason for the high study time, not the reverse. - Parental Education (
Education_Decoded): The level of support, guidance, or preparation that parents can provide can influence the effectiveness of the study efforts.
- Tutoring (
- The Weight of the 'F' Grade (50.7%): As previously observed in Section 5, over half of the student body falls into the 'F' grade class. It is likely that a large portion of these struggling students is dedicating substantial hours in a desperate attempt to pass their courses, but without success due to the systemic or quality-related issues listed above.
This analysis tests the hypothesis that receiving external support, specifically Tutoring (Tutoring_Decoded), is a stronger predictor of academic performance than simply the hours spent studying (as suggested by the weak correlation found in Section 6).
| Tutoring Status | Average GPA |
|---|---|
| No | 1.819 |
| Yes | 2.108 |
Students who received tutoring achieved an average GPA of 2.108, which is significantly higher (approximately +16%) than students who did not receive tutoring (average GPA of 1.819).
Figure 8: Bar plot comparing the average GPA of students who receive tutoring versus those who do not.
The results provide a strong contrast to the findings from the previous section (GPA vs. Weekly Study Time):
- Tutoring as a Positive Predictor: Unlike raw study hours, the Tutoring Status shows a clear and positive relationship with the GPA. This strongly suggests that the quality or targeted nature of study (often provided by tutoring) is a far more influential factor than the mere quantity of time dedicated to studying.
- Addressing the 'F' Grade Problem: Given that the dataset is heavily skewed towards low performance (50.7% 'F' grades), the students receiving tutoring are successfully moving the average score above the 2.0 threshold (Grade D/C border), indicating that this support helps students overcome significant performance deficits.
- Hypothesis Validation: The initial hypothesis is validated: the presence of support (
Tutoring) is a better predictor of improved performance than the quantitative metric ofStudyTimeWeekly.
This section analyzes the relationship between the level of Parental Education (Education_Decoded) and the student's Average GPA. This factor is often considered a strong socioeconomic predictor of academic success.
The results show a counter-intuitive distribution of average GPA scores, contrasting general academic expectations:
| Education Level | Average GPA |
|---|---|
| High School | 1.944 |
| Some College | 1.930 |
| None | 1.893 |
| Higher | 1.816 |
| Bachelor's | 1.809 |
The highest average GPAs are found among students whose parents have a High School or Some College education, while the lowest GPAs are associated with the highest parental education levels (Bachelor's and Higher degrees).
Figure 9: Bar plot comparing the average GPA segmented by the five levels of parental education.
The lack of a direct positive correlation (where higher education leads to higher GPA) suggests the influence of confounding factors specific to this dataset:
- Socioeconomic Heterogeneity: The 'Parental Education' category alone may not capture the full socioeconomic status or the family's direct involvement. Families with higher degrees might belong to demographics (e.g., specific ethnic groups or income levels) that, in this particular dataset, have lower overall academic scores for other systemic reasons.
- Reverse Causality in Support: Parents with higher degrees might be less involved in daily school work, assuming the student is self-sufficient, or the pressure to perform might be higher, leading to stress that negatively affects the GPA.
- Tutoring Effect: As confirmed in Section 7, Tutoring is a key predictor. It is possible that students from families with lower formal education levels are more likely to seek or receive practical, direct tutoring support (which was effective), while students from higher-education families may rely solely on their own resources, or the tutoring factor may not be evenly distributed across these educational tiers.
- Data Skew: Given the overall low mean GPA (around 1.9) and the high proportion of 'F' grades, the dataset's performance issues are deeply rooted and transcend simple parental education categories.
Conclusion: Parental education is not a simple linear predictor in this dataset, and its influence is likely masked or complicated by other variables such as ethnicity, direct support strategies, or underlying data distribution issues.
This section documents the exploration of Parental Support as a predictor of academic performance (GPA), revealing it to be the most strongly and linearly correlated factor in the dataset.
The analysis calculated the mean GPA for each of the five Parental Support categories (None, Low, Moderate, High, Very High) to determine the relationship between the level of support and academic outcome.
The analysis confirms a strong, positive, and progressive correlation between the level of parental support and the student's average GPA. This factor demonstrated the clearest linear trend among all predictors explored.
The results are summarized as follows:
| Parental Support Level | Average GPA |
|---|---|
| None | |
| Low | |
| Moderate | |
| High | |
| Very High |
Figure 10: Average GPA By Parental Support Level. This visualization clearly demonstrates that as the level of parental support increases, the average GPA consistently rises, highlighting the factor's significance in student academic outcomes.
The comparative analysis of the student performance data revealed several crucial and sometimes counter-intuitive insights regarding the factors that truly influence academic outcomes in this specific dataset.
| Factor | Result | Implication |
|---|---|---|
| Parental Support Level (Sec. 9) | Strongest linear correlation with GPA found. | Every incremental increase in parental support corresponds to a consistent and measurable rise in average student GPA, making it the most reliable predictor. |
| Tutoring Status (Sec. 7) | Strong positive correlation with GPA. | Targeted intervention (Tutoring) is highly effective. Students with tutoring achieved a significantly higher average GPA. |
| Study Time (Sec. 6) | No clear positive correlation with GPA. | Quantity of study is not the determining factor; quality is more important. |
| Parental Education (Sec. 8) | Counter-intuitive distribution. | Higher parental degrees (Bachelor's/Higher) correlated with lower student GPA, suggesting confounding factors (e.g., family support, stress, or other unmeasured variables) are at play. |
| Demographics (Sec. 4, 7 & 8) | Gender is balanced; Ethnicity is skewed. | While the overall dataset is dominated by the Caucasian group, the distribution of final GradeClass (AβF) is uniformly distributed across all ethnic groups (Analysis skipped for redundancy). |
The dataset exhibits a severe performance gap, with over 50% of students receiving a final Grade 'F' (Section 5). The analysis strongly indicates that Parental Support is the most potent environmental factor for improving performance, closely followed by Targeted Intervention (Tutoring). These two factors are the most effective strategies for mitigating this gap, significantly outperforming both general study time and the presumed advantage of highly educated parents.
To move beyond comparative statistics and isolate the true impact of each variable, the project must transition into predictive modeling:
- Hypothesis Testing: Utilize regression analysis to quantify the strength of the relationship between variables like
ParentalSupport,Tutoring,StudyTimeWeekly, andGPA. - Predictive Model: Develop a classification model (e.g., Logistic Regression or Random Forest) to accurately predict the likelihood of a student achieving a low grade (
GradeClass= 'F'), using all available features (including the numeric codes). This will provide actionable insights for intervention strategies.









