To predict whether a student drops out of university based on socioeconomic factors whether they have a scholarship and whether they are up to date with paying their tuition fees.
For classification and regression tasks, the Support Vector Machine (SVM) algorithm is a well-liked supervised machine learning approach. It works especially well when analysing complex datasets with distinct class boundaries, in this case there are three binary datatypes, so this was chosen to be ideal. Fundamentally, SVM seeks an ideal hyperplane, the decision boundary, that maximally divides several classes of data points in the feature space. Support vectors are utilised to generalise and generate predictions for additional, unforeseen data points and are vital in creating the hyperplane.
Both linearly separable and non-linearly separable datasets can be used with the SVM method, however in this case, a kernel technique was used for non-linearly separable data. The kernel trick involves mapping the data into a higher-dimensional feature space, where it becomes linearly separable Filtering, balancing and splitting data. There are three important parameters to consider and optimise when using SVM:
- C: The trade-off between obtaining a low training error and a low margin hyperplane is controlled by the C parameter. While a bigger C value reduces the margin to classify more points properly, a smaller C value increases the margin, resulting in misclassification of training data.
- Kernel: each type optimised for different types of data. This included RBF, Poly, linear and sigmoid.
- Gamma: parameter describes the impact of a single training example on the decision boundary's form. Higher gamma values typically produce complicated decision limits, which could result in overfitting.
After narrowing down the dataset to socioeconomic factors, a forward selection technique was employed to choose the best features for the model, optimising for fewer features and reducing the danger of overfitting. A model without any features was initially developed. The feature with the next-highest correlation to Target_0 was found through an iterative procedure, ‘Tuition Fees Up to Date’, with an accuracy of 0.498, and ‘Mothers qualification', bringing the accuracy up to 0.769. I added a threshold value of 0.05, only allowing a new feature if it increased the model's validation accuracy by 0.05 (5%). This resulted in a more accurate model than manually choosing the features, despite the large data set – with the 0.005 inclusion condition avoids overfitting.
The binary target variable, Target_0 (Dropout), was manipulated from Target, which was made an integer (0-2) then split into 3 columns using pd.get_dummies. Looking at the confusion matrices of the initial Forward Selection Algorithm, balancing was necessary to reduce the discrepancy between the non-dropout (3003) case with the dropout case (1421), via random sampling. When the binary target variable has an imbalanced distribution like in this case, with significantly more negative instances than positive instances, to help mitigate the impact of the majority class and prevent it from dominating the model's training. The data points were randomly split 60/20/20 training, validation, and testing. The model was fit on the undersampled data training set, which was already balanced.