-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path210_LPM.Rmd
93 lines (59 loc) · 6.01 KB
/
210_LPM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
<h2><a name="21_LPM" style="pointer-events: none; cursor: default;" ><font color="black"> 2.1. Linear Probability Model</font></a>
</h2>
The regression model defines the linear discriminant function, $\delta(x)$, as the expected value of the target conditioning on the observed data, $E[Y|X]$. In case of a dichotomous outcome, this is equivalent to estimating the probability for an observation to belong to category 1, $P(Y=1|X)$, hence the name linear probability model.
However $E[Y|X]$ is by construction a real number, so it lends poorly in estimating a quantity that lies between 0 and 1. Nonetheless, it is possible to show that the sum of the probabilities to belong to every single class is 1.
To see why this is true, consider the general case of a multinomial classification problem (G categories) based on *p* regressors.
Then, the LPM approach consists of adopting G binary regression models simultaneously, one per each class. The issue then resolves to estimating the probability of belonging to each category and classifying the observation as the class which returns the highest response.
In matricial form:
$$
\hat{Y}_{n \times G} = X_{n \times (p+1)} ({X}^{T} X)_{(p+1) \times (p+1)}^{-1} {X}_{(p+1) \times n}^{T} Y_{n \times G}$$
Post-multiplying by a G-dimensional column vector of ones we can then sum the elements of $\hat{Y}$:
$$\hat{Y}_{n \times G} \textbf{1}_{G \times 1} = X_{n \times (p+1)} ({X}^{T} X)_{(p+1) \times (p+1)}^{-1} {X}_{(p+1) \times n}^{T} Y_{n \times G} \textbf{1}_{G \times 1}$$
Now, pre-multiplying the identity matrix of order $n$ written as $(X {X}^{T})^{-1}(X {X}^{T})$, it is easy to verify that the previous equation resolves to:
$$(X {X}^{T})^{-1}(X {X}^{T}) X_{n \times (p+1)} ({X}^{T} X)_{(p+1) \times (p+1)}^{-1} {X}_{(p+1) \times n}^{T} Y_{n \times G} \textbf{1}_{G \times 1}$$
$$(X {X}^{T})^{-1}X \biggl[({X}^{T} X_{n \times (p+1)}) ({X}^{T} X)_{(p+1) \times (p+1)}^{-1}\biggr] {X}_{(p+1) \times n}^{T} Y_{n \times G} \textbf{1}_{G \times 1}$$
$$\biggl[(X {X}^{T})^{-1}X {X}_{(p+1) \times n}^{T}\biggr] Y_{n \times G} \textbf{1}_{G \times 1}$$
$$\hat{Y}_{n \times G} \textbf{1}_{G \times 1} = Y_{n \times G} \textbf{1}_{G \times 1}$$
Hence the sum of the predicted scores is 1 for all the observations. In fact, $\hat{Y_1}+ ... + \hat{Y_G} = 1$ by construction.
For this reason, although it is not possible to interpret the predicted scores as probabilities, they share this property.
Passing to the application, we adopted such a model considering all of the initial regressors. It is easy to verify as, in general, the score of such models is not guaranteed to lie between 0 and 1, as anticipated:
```{r, fig.width=9.5, fig.height=4, echo=FALSE}
file = "results/MODELING/CLASSIFICATION/lpm_probs.Rdata"
load( file )
plot
```
<br></br>
<br></br>
Dando uno sguardo all'output, possiamo notare come quasi tutte le variabili risultino significative per qualsiasi soglia $\alpha$ ragionevole. Le uniche eccezioni riscontrate sono *chlorides* e *citric.acid* con p-value rispettivamente di 0.37 e 0.08.
```{r, fig.width=9.5, fig.height=4, echo=FALSE}
file = "results/MODELING/CLASSIFICATION/lpm_summary.Rdata"
load( file )
datatable(df,
options = list(pageLength = 13),
rownames = F,
class = "display")
```
<br></br>
<br></br>
Focusing on the estimates for the coefficients, it is possible to notice as the most relevant variable for the prediction seems to be *density*. However, the high value of this coefficient (-36.12) appears balanced by the intercept, both in terms of punctual estimate and variability.
This behaviour is due to the feature having a dense distribution which is highly correlated to the target. Thus, each slight variation of its value is associated with a significant change in the predicted outcome.
This guess is confirmed when performing the analysis with standardised variables. In fact, this time the coefficient for *density* is much more in line with the other estimates. Also, the most relevant features become *volatile.acidity*, *alcohol* and *type* (respectively -0.15, 0.14 and - 0.13). No changes in the patterns of significance are observed though.
```{r, fig.width=9.5, fig.height=4, echo=FALSE}
file = "results/MODELING/CLASSIFICATION/lpm_summary_std.Rdata"
load( file )
datatable(df,
options = list(pageLength = 13),
rownames = F,
class = "display")
```
<br></br>
<br></br>
Once a score has been obtained, it is then necessary to fix a cutoff so to classify each observation into one of the two categories. This choice is not that obvious, as the outcome of the model is not bounded in $[0,1]$. For this reason, the choice of the optimal threshold is don based on a grid search so to minimise the misclassification error. In particular, using cutoff values between 0 and 1 with gaps of 0.01, the optimal choice for the threshold is 0.54.
Having used the test set for choosing the optimal cutoff, it is no longer possible to use them to evaluate the performances of the model without overestimating due to overfitting. For this reason, the prediction error is retrieved based on Leave One Out cross-validation. In particular, for a linear regression with quadratic loss it is possible to compute this measure estimating the model just one time instead of $n$:
$$ GCV = \frac{1}{n} \sum_{i=1}^{n} { \biggl[ \frac{ y_i - \hat{f}(x_i) }
{1- \frac{trace(S)}{n}} \biggr]^2 } $$
where $trace(S)$ represents the effective degrees of freedom of the model (in our case $p+1=13$).
Notice that, despite our loss not being a quadratic function, it is easy to show that it is equivalent to the misclassification error due to the dichotomic nature of $y_i$ and $\hat{f}(x_i)$. Hence, the approach described above is justified.
In conclusion, the accuracy obtained through the Generalised cross-validation is 0.7402 (instead of the 0.7488 computed in the test set, which is an optimistic estimate as expected). In terms of Receiver Operating Characteristic (ROC) curve, instead, the model has an Area Under Curve (AUC) of 0.8095.
<br></br>
<br></br>