diff --git a/Statistical_Learning.html b/Statistical_Learning.html index e736ce4..dd33fa8 100644 --- a/Statistical_Learning.html +++ b/Statistical_Learning.html @@ -3649,8 +3649,8 @@

This report presents a statistical analysis of the wine quality dataset released by the Portuguese firm Vinho Verde, aiming at classifying sample qualities based on their physical-chemical properties. To this end, various models and different approaches are compared so to individuate the best in terms of out-of-sample predictions.

The original data are split according to the wine typology, with 4898 white wine samples and 1599 are reds. However, the two sets are considered altogether in this report, and a variable type is added with value 1 for white wines and 0 otherwise. All in all, the final dataset is then composed of a total of \(n=6497\) statistical units, with 11 variables expressing physical-chemical qualities and the additional feature type. The target variable is quality and measures the sample quality as an ordinal categorical variable in the range from 1 to 10. Each of its values is obtained as the median of at least 3 votes expressed by expert sommeliers.

-
- +
+



As we can see, the actual range of the target is the inteval \([3,9] \in \mathbb{N}\). Also, the categories 5 and 6 are the most numerous and they alone account for the 76.6% of the whole sample. This may be a problem when training since the model could learn to predict just the most represented classes. For this reason, we try to balance the target variable by considering a binary version obtained as follows:

\[ \begin{equation} @@ -3674,13 +3674,13 @@

  • Alcohol (alcohol)
  • As first insight, the following plots illustrate the distributions of each variable in both subsamples (red and yellow curves) and in the whole dataset (blue curve):

    -


    +




    The features are mostly bell-shaped, except for total.sulfur.dioxide, citric.acid and chlorides that show some symptoms of being multimodal. Moreover, all the variables have a quite similar range of variation, thus suggesting some pre-processing (e.g. standardisation). For this reason, we performed the analysis both with and without preliminary transformations of the original data, and we found non-significant differences between the two approaches. Hence, only results obtained without pre-processing are shown in the following for convenience.

    Although correlation does not mean causation, it makes sense to explore the dependency structure between the target variable and each of the predictors.

    For this purpose, the correlation matrix computed on the whole dataset is reported below:

    -
    - +
    +



    The target quality (in its original format) is very poorly correlated with the predictors, and just in two cases we observe a linear correlation coefficient above 0.3 (alcohol and density). On the other hand, predictors appear highly linked to one another. In particular, the strongest relationships we can observe are:

    However, given the categorical nature of the variable Quality, it would be more appropriate to study its correlation structure based on the Spearman coefficient rather than the Pearson one:

    -
    - +
    +



    The Spearman coefficient, \(\rho\), measures the degree of relation between two ordinal categorical variables. Unlike the Pearson coefficient, Spearman \(\rho\) assumes just a monotonic relationship between the variables (not necessarily a linear one).

    For more details on the dataset check:

    @@ -3720,16 +3720,16 @@

    Hence the sum of the predicted scores is 1 for all the observations. In fact, \(\hat{Y_1}+ ... + \hat{Y_G} = 1\) by construction.

    For this reason, although it is not possible to interpret the predicted scores as probabilities, they share this property.

    Passing to the application, we adopted such a model considering all of the initial regressors. It is easy to verify as, in general, the score of such models is not guaranteed to lie between 0 and 1, as anticipated:

    -
    - +
    +



    Notice that almost all of the coefficients are statistically significant for any reasonable confidence level \(\alpha\). The only exceptions are chlorides and citric.acid, with p-values of respectively 0.37 and 0.08.

    -
    - +
    +



    Focusing on the estimates for the coefficients, it is possible to notice as the most relevant variable for the prediction seems to be density. However, the high value of this coefficient (-36.12) appears balanced by the intercept, both in terms of punctual estimate and variability. This behaviour is due to the feature having a dense distribution which is highly correlated to the target. Thus, each slight variation of its value is associated with a significant change in the predicted outcome. This guess is confirmed when performing the analysis with standardised variables. In fact, this time the coefficient for density is much more in line with the other estimates. Also, the most relevant features become volatile.acidity, alcohol and type (respectively -0.15, 0.14 and - 0.13). No changes in the patterns of significance are observed though.

    -
    - +
    +



    Once a score has been obtained, it is then necessary to fix a cutoff so to classify each observation into one of the two categories. This choice is not that obvious, as the outcome of the model is not bounded in \([0,1]\). For this reason, the choice of the optimal threshold is don based on a grid search so to minimise the misclassification error. In particular, using cutoff values between 0 and 1 with gaps of 0.01, the optimal choice for the threshold is 0.54.

    Having used the test set for choosing the optimal cutoff, it is no longer possible to use them to evaluate the performances of the model without overestimating due to overfitting. For this reason, the prediction error is retrieved based on Leave One Out cross-validation. In particular, for a linear regression with quadratic loss it is possible to compute this measure estimating the model just one time instead of \(n\):

    @@ -3745,15 +3745,15 @@

    The linear discriminant analysis (LDA) consists of an alternative solution for building a discriminant function. Indeed, rewriting \(P(Y=g|X)\) as \(\frac{f_g(x)\pi_g}{\sum_{l=1}^{G}f_l(x)\pi_l}\) it is possible to estimate the probability of each class based on its density function (and its prior probability). Starting from this formulation, the linear discriminant analysis assumes that the \(f_g(x)\) are \(p+1\)-dimensional gaussian distribution with common covariance matrix.

    Once the \(f_g(x)\) are estimated based on sample data, it is then possible to show that the discriminant function associated to each category turns out to be: \[\delta_g(x)=x^{T}\Sigma^{-1}\mu_g -\frac{1}{2}\mu_g^T\Sigma^{-1}\mu_g+\log{\pi_g}\] Finally, in order to classify observations into one of the classes we simply pick \(g_*=argmax_g\delta_g(x)\).

    Passing to the application, the model reaches an accuracy of 0.7438 and an AUC of 0.8095.

    -
    - +
    +



    Alternatively, the LDA approach can be rephrased as finding the linear combination of the regressors that better separates the categories of the target variable. In other words, we have to find \(a^Tx\) such that the groups are internally as homogeneous as possible and as distant as possible among each other. This means we want low variability within each class and high between variance: \[\phi_*=argmax_{\textbf{a}}\frac{\textbf{a}^TB\textbf{a}}{\textbf{a}^TW\textbf{a}}\] where \(B\) is the between-groups covariance matrix and \(W\) the within one.

    Solving this optimisation problem we get: \[(W^{-1}B-\phi I_{G \times G})\textbf{a}=0\]

    This is a homogeneous system of linear equations, so it admits non-trivial solutions if and only if \((W^{-1}B-\phi I)=0\), i.e. if \(\phi\) is an eigenvalue of \(W^{-1}B\) and \(a\) contains the correspondent eigenvector. Algebraically, this implies that we have at most \(G-1\) solutions, precisely as many as the rank of \(W^{-1}B\). As a consequence, there are no more than \(G-1\) discriminant directions. Since \(\phi\) is exactly the quantity we want to maximise, the best discriminant function is given by the eigenvector corresponding to the biggest eigenvalue of \(W^{-1}B\). Namely, it is the surface that better separates one class from all the others.

    In the case of two categories, we have a single non-zero eigenvalue and, hence, a single discriminant direction. To understand better the results of this approach, the following plot displays the observations with respect to the linear combination \(a^Tx\), also called canonic coordinate, with the linear decision boundary superimposed.

    -
    - +
    +



    Notice that the two classes do not appear very separated. Probably, the quite decent performances of the model are because the dataset is unbalanced and that the model correctly classifies the most represented categories. i.e. observations of class 1.

    Comparing the results of the two LDA formulations, we observe that the vector \(a\) is unexpectedly different from the Coefficients of linear discriminants computed by the R routine. However, this is merely a matter of parameterisation. In fact, R uses a normalisation that makes the within-covariance matrix spherical, i.e.:

    @@ -3791,15 +3791,15 @@

    \frac{1}{n} \sum_{i=1}^{n}{y_i(\beta_0+x_i^T\beta)-\log{(1+\exp{(\beta_0+x_i^T\beta))}}} \biggr] + \lambda[ \alpha ||\beta||_1 + (1-\alpha)||\beta||_2^2]}\]

    The optimal values for \(\lambda\) and \(\alpha\) are obtained by 10-fold cross-validation.

    -
    - +
    +



    Analysing the results obtained, we can observe as the ridge regression reaches the highest accuracy. However, the three methods have worse performances than the plain logistic regression, i.e. without penalisation. This suggests that in fact no regularisation is needed for the problem at hand. A further clue in favour of this thesis is then given by the estimates of shrinkage parameters, which are all close to zero. In particular, it is interesting to notice that the value picked for \(\alpha\) in the elastic-net approach is equal to 0.05. This means that the two penalties are not generally very effective and that the one related to the ridge solution is the most relevant, as found in the disentangled experiments. Accuracy-wise the results are really similar, with values of 0.7305, 0.7345 and 0.7195 respectively. Same considerations are drawn looking at the ROC curves, with AUC values of 0.80, 0.7995 and 0.7897.

    -

    Focusing on the estimates of the coefficients, we have a confirmation of the fact that ridge regression includes all of the parameters, as expected. O the contrary, the lasso allows performing feature selection based on the coefficients set to 0, i.e. fixed.acidity, chlorides, density and type.

    -
    - -
    - +

    Focusing on the estimates of the coefficients, we have a confirmation of the fact that ridge regression includes all of the parameters, as expected. O the contrary, the lasso allows performing feature selection based on the coefficients set to 0, i.e. fixed.acidity, chlorides, density and type.

    +
    + +
    +



    Likewise, elastic-net also performs parameter selection despite a penalisation coefficient very low. This time, however, only fixed.acidity and type are removed from the analysis.

    In light of these considerations, and due to the little difference in terms of performances, we may want to select the lasso approach since it reaches similar results with a simpler, more parsimonious model.

    @@ -3810,8 +3810,8 @@

    The K-Nearest Neighbour is a classification algorithm with no parameters but K. The key principles it is based on are the euclidean distance between observations and the “majority vote” for classification.

    As far as the choice of the neighbourhood dimension, we adopt 10-fold cross-validation to select the optimal K. Input data are standardised prior to this process so to avoid to give more importance to variables with larger scales. The optimal value obtained is \(K=23\).

    Also this model is in line with previous results, with an accuracy of 0.7572 and an AUC di 0.8192.

    -
    - +
    +



    2.7. Generalised Additive Model @@ -3819,8 +3819,8 @@

    The GAMs are additive models that enable to overcome the linearity constraint trough the introduction of more flexible functions of the regressors. In the case of binary classification, we can write such a model as follows:

    \[\log{ \biggl( \frac{P(Y=1|X)}{1-P(Y=1|X) } \biggr)}=\alpha + \sum_{k=1}^p{f_k(X)}\]

    As far as our application is concerned, we consider smoothing splines as transformations of the input features. The optimal degrees of freedom for these functions are selected by 10-fold cross-validation, with a grid search in the range \([1,10]\). The best model turns out to be the one with 7 degrees of freedom, achieving an accuracy of 0.7735 e and an AUC equal to 0.8305.

    -
    - +
    +



    Nonetheless, the improvement introduced by this greater complexity is minimal, so it may be worth simplifying the model in favour of a more parsimonious one. For this reason, one may wish to select \(df=4\) since it corresponds to a natural cubic spline and it is subject to lower variability.



    @@ -3830,17 +3830,17 @@

    n this report, we have conducted a classification analysis and we have compared many different approaches to this end, in a view typical of the data science field: selecting the model that guarantees the best out-of-samples predictions. Nonetheless, theoretical insights on the statistical properties of several models have also been treated.

    In summary, the comparison takes into account the adoption of:

    The table below shows the optimal threshold for classification according to each of the model explored, together with the relative test set accuracy, sensitivity, specificity and AUC.

    -
    - +
    +



    Looking at the accuracy metric, the GAM is the one performing better (0.7735), followed by KNN (0.7512) and all of the linear models. Notice that, excluding the regularised approaches, QDA is the worst notwithstanding the use of a more complex and flexible separation surface. On the other hand, it is interesting to notice how the LPM is placed on the last step of the podium (0.7488) despite its simplicity and the theoretical inadequacy to represent a probability through an output not bounded in \([0,1]\). In general, this can be explained observing that, in a classification problem, the critical task is not to predict precisely the probability of each category but rather to separate the classes properly.

    Considering the AUC instead, the performances of the various models remain unchanged in terms of comparison. The only exception is KNN that falls from the second to the last position, thus shifting up all the intermediate models.

    -
    - +
    +



    In conclusion, the GAM is the best model both in terms of accuracy and AUC, strictly followed by KNN and LMP.

    However, the performances of the various models are quite similar, as we can see from the ROC curves above, which are all very close. That being considered, one may decide to prefer a more parsimonious model in order to gain in simplicity and interpretability at the expense of a minimal loss in predictive performance.

    diff --git a/Statistical_Learning_files/figure-html/unnamed-chunk-13-1.png b/Statistical_Learning_files/figure-html/unnamed-chunk-13-1.png new file mode 100644 index 0000000..2cd3de4 Binary files /dev/null and b/Statistical_Learning_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/Statistical_Learning_files/figure-html/unnamed-chunk-18-1.png b/Statistical_Learning_files/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 0000000..2cd3de4 Binary files /dev/null and b/Statistical_Learning_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/code/030_Exploratory.R b/code/030_Exploratory.R index 91c7b89..4aef1ff 100644 --- a/code/030_Exploratory.R +++ b/code/030_Exploratory.R @@ -132,7 +132,7 @@ corrplot = ggplotly( ggcorrplot(corr_matrix, hc.order = TRUE, outline.col = "white", #ggtheme = ggplot2::theme_gray, colors = c("#6D9EC1", "white", "#E46726")) + - ggtitle("Matrice di correlazione")) + ggtitle("Correlation matrix")) # ********** Saving a file ******************* # file_name = paste0( folder, "/corrplot.Rdata") save( corrplot, file = file_name) @@ -168,7 +168,7 @@ data_corr = data.frame( Variables = colnames( wine[ , 1:11]), corr_Y = ggplot(data_corr, aes( Variables, Spearman, fill = Variables, text = paste('Kendall:', Kendall, "\n", 'Pearson:', Pearson, "\n" ))) + geom_bar( stat = "identity", position='stack') + - ggtitle( "Correlazione della variabile Qualità" ) + guides( fill = FALSE ) + + ggtitle( "Quality variable correlation" ) + guides( fill = FALSE ) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylab("Spearman's correlation") # theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) @@ -189,7 +189,7 @@ df$Relative_Freq = paste( round( df$Frequencies/sum(df$Frequencies)*100, 1), '%' quality_distr = ggplot(data=df, aes(x=Quality, y=Frequencies, fill = Quality)) + geom_bar(stat="identity") + geom_text(aes(label=Relative_Freq), vjust=-5, color="black") + # - ggtitle( 'Qualità - Distribuzione di frequenza') + ggtitle( 'Quality - Frequency Distribution') quality_distr = ggplotly( quality_distr ) @@ -201,4 +201,3 @@ save( quality_distr, file = file_name) rm(list= ls()) -cat('\n\n SCRIPT ESEGUITO CORRETTAMENTE!! \n\n') diff --git a/code/076_Modeling_Classification.R b/code/076_Modeling_Classification.R index d209351..b405c09 100644 --- a/code/076_Modeling_Classification.R +++ b/code/076_Modeling_Classification.R @@ -327,11 +327,11 @@ elnet_validation = ggplot(data = elnet_data, aes( x = lambda, group = alpha, # "Precisione nominale:", Precision) + geom_line( aes( y = Accuracy, color = alpha )) + geom_point( aes( y = Accuracy, color = alpha ), show.legend = F) + - ggtitle( "Validazione parametri Elastic Net" ) + + ggtitle( "Elastic Net validation" ) + theme(plot.title = element_text(size = 15, face = "bold")) ply_val_elnet = ggplotly( elnet_validation ) %>% - layout(title = "Validazione parametri Elastic Net", + layout(title = "Elastic Net validation", legend = list(orientation = "v")) # , y = 0, x = 0)) save_plot( ply_val_elnet, type = "CLASSIFICATION") @@ -392,12 +392,12 @@ gam_validation = ggplot(data = gam_data, aes( x = df)) + geom_line( aes( y = Accuracy, color = 'red' )) + geom_point( aes( y = Accuracy, color = 'red' ), show.legend = F) + geom_ribbon(aes(ymin=Accuracy-AccuracySD, ymax=Accuracy+AccuracySD), linetype=2, alpha=0.1) + - ggtitle( "Validazione parametri GAM" ) + + ggtitle( "GAM validation" ) + theme(plot.title = element_text(size = 15, face = "bold")) + scale_x_continuous(breaks=1:10) ply_val_gam = ggplotly( gam_validation ) %>% - layout(title = "Validazione parametri GAM", showlegend = F) + layout(title = "GAM validation", showlegend = F) # legend = list(orientation = "v")) # , y = 0, x = 0)) save_plot( ply_val_gam, type = "CLASSIFICATION") @@ -467,7 +467,7 @@ knn_kappa = ggplot(data = cv_df, aes(x = k)) + knn_kappa = ggplotly( knn_kappa ) knn_cv_plot = subplot( knn_acc, knn_kappa ) %>% - layout(title = "Validazione parametri knn - Accuratezza vs Kappa", legend = list(orientation = "v")) + layout(title = "knn validation - Accuracy vs Kappa", legend = list(orientation = "v")) save_plot( knn_cv_plot, type = 'CLASSIFICATION') diff --git a/markdown/100_main.Rmd b/markdown/100_main.Rmd index 379a6ce..77c31f8 100644 --- a/markdown/100_main.Rmd +++ b/markdown/100_main.Rmd @@ -60,7 +60,7 @@ library('DT') ``` -```{r child = '250_REGURALISED_METHODS.Rmd'} +```{r child = '250_REGULARISED_METHODS.Rmd'} ``` diff --git a/markdown/150_Introduction.Rmd b/markdown/150_Introduction.Rmd index a866bcc..ea4afa3 100644 --- a/markdown/150_Introduction.Rmd +++ b/markdown/150_Introduction.Rmd @@ -82,7 +82,7 @@ For this purpose, the correlation matrix computed on the whole dataset is report ```{r, fig.width=9.5, fig.height=6, echo=FALSE} -file = "results/EXPLORATORY_ANALYSES/corrplot.Rdata" +file = "../results/EXPLORATORY_ANALYSES/corrplot.Rdata" load( file ) corrplot diff --git a/markdown/220_LDA.Rmd b/markdown/220_LDA.Rmd index d35c3b2..a994bbb 100644 --- a/markdown/220_LDA.Rmd +++ b/markdown/220_LDA.Rmd @@ -38,7 +38,7 @@ load( file ) file = "../results/MODELING/CLASSIFICATION/canonical_variable2.Rdata" load( file ) -subplot( canonical_variable, canonical_variable2) %>% layout( title = 'Variabile Canonica') +subplot( canonical_variable, canonical_variable2) %>% layout( title = 'Canonical variable') ```