-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy path_variable_selection.Rmd
1367 lines (918 loc) · 52.4 KB
/
_variable_selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Variable Selection
Imagine you're a detective standing before a pinboard covered in clues---some are glaringly obvious, while others might be red herrings. Your mission? To pick which pieces of evidence will crack the case. This is the essence of variable selection in statistics: deciding which variables best uncover the story behind your data. Far from a mechanical chore, it's a high-stakes balancing act blending analytical goals, domain insights, data realities, and computational feasibility.
Why Does Variable Selection Matter?
- **Focus and Clarity**: Models cluttered with unnecessary variables can obscure the real relationships or patterns in your data. By identifying the variables that truly drive your results, you sharpen your model's focus and interpretability.
- **Efficiency and Performance**: Too many variables can lead to overfitting---fitting the quirks of a single dataset rather than underlying trends. Streamlined models often run faster and generalize better.
- **Practical Constraints**: In many real-world scenarios, data collection or processing costs money, time, and effort. Prioritizing the most meaningful variables becomes not just a statistical concern, but a strategic one.
**Key Influences on Variable Selection**
1. **Objectives or Goals**
- *Prediction vs. Inference*: Are you trying to forecast future outcomes or explain why certain events happen? Prediction-focused models might include as many relevant features as possible for accuracy, whereas inference-driven models often strive for parsimony and clearer relationships.
- *Balance*: Some analyses blend both objectives, requiring careful negotiation between complexity (to maximize predictive ability) and simplicity (to maintain interpretability).
2. **Previously Acquired Expertise**
- *Domain Knowledge*: Whether you're analyzing financial trends or studying medical records, familiarity with the subject can reveal which variables are naturally linked to the phenomenon.
- *Subtle Clues*: Experts can uncover hidden confounders---variables that outwardly seem irrelevant yet dramatically influence results.
3. **Availability and Quality of Data**
- *Completeness*: Missing data or sparse measurements can force you to discard or transform variables. Sometimes the ideal variable simply isn't present in your dataset.
- *Reliability*: A variable riddled with measurement errors or inconsistencies may do more harm than good.
4. **Computational Resources and Software**
- *Toolset Capabilities*: Some statistical techniques or advanced machine learning methods thrive on large sets of variables, while others become unwieldy.
- *Time and Memory Constraints*: Even the most sophisticated algorithms can choke on too much data if hardware resources are limited.
Overview of various variable selection criteria and algorithms
1. Information Criteria for Model Selection
1. [Akaike Information Criterion (AIC)](#akaike-information-criterion-aic)
2. [Bayesian Information Criterion (BIC)](#bayesian-information-criterion-bic)
3. [Mallows's C Statistic](#mallowss-c-statistic)
4. [Hannan-Quinn Criterion (HQC)](#hannan-quinn-criterion-hqc)
5. Adjusted $R^2$
6. [Minimum Description Length (MDL)](#minimum-description-length-mdl)
2. Search-Based Selection Methods
1. Exhaustive Search (Best Subsets Algorithm)
2. [Best Subsets Algorithm]
3. [Stepwise Selection Methods]
1. [Forward Selection]
2. [Backward Elimination]
3. [Stepwise (Both Directions) Selection](#stepwise-both-directions-selection)
4. [Branch-and-Bound Algorithm]
5. [Recursive Feature Elimination (RFE)](#recursive-feature-elimination-rfe)
6. [Genetic Algorithms]
3. Regularization Methods (Embedded Variable Selection)
1. [Lasso Regression] (L1 Regularization)
2. [Ridge Regression] (L2 Regularization)
3. [Elastic Net] (Combination of L1 and L2)
4. Other Criteria and Methods
1. [Prediction Error Sum of Squares (PRESS)](#prediction-error-sum-of-squares-press)
2. Bayesian Model Selection
Throughout this chapter, let $P$ denote the number of potential predictor variables ($X_1, X_2, \dots, X_{P-1}$).
------------------------------------------------------------------------
## Mallows's C Statistic {#mallowss-c-statistic}
The $C_p$ statistic (Mallows, 1973, *Technometrics*, 15, 661-675) [@mallows1995more] is a criterion used to evaluate the predictive ability of a fitted model. It balances model complexity and goodness-of-fit.
For a model with $p$ parameters, let $\hat{Y}_{ip}$ be the predicted value of $Y_i$. The **total standardized mean square error of prediction** is:
$$
\Gamma_p = \frac{\sum_{i=1}^n E(\hat{Y}_{ip} - E(Y_i))^2}{\sigma^2}
$$
Expanding $\Gamma_p$:
$$
\Gamma_p = \frac{\sum_{i=1}^n [E(\hat{Y}_{ip}) - E(Y_i)]^2 + \sum_{i=1}^n \text{Var}(\hat{Y}_{ip})}{\sigma^2}
$$
- The **first term** in the numerator represents the squared bias.
- The **second term** represents the prediction variance.
**Key Insights**
1. **Bias-Variance Tradeoff**:
- The **bias** decreases as more variables are added to the model.
- If the **full model** ($p = P$) is assumed to be the true model, $E(\hat{Y}_{ip}) - E(Y_i) = 0$, implying no bias.
- The **prediction variance** increases as more variables are added: $\sum \text{Var}(\hat{Y}_{ip}) = p \sigma^2$.
- Therefore, the optimal model balances bias and variance by minimizing $\Gamma_p$.
2. **Estimating** $\Gamma_p$: Since $\Gamma_p$ depends on unknown parameters (e.g., $\beta$), we use an estimate:
$$
C_p = \frac{SSE_p}{\hat{\sigma}^2} - (n - 2p)
$$
- $SSE_p$: Sum of squared errors for the model with $p$ predictors.
- $\hat{\sigma}^2$: Mean squared error (MSE) of the full model with all $P-1$ predictors.
3. **Properties of** $C_p$:
- As more variables are added, $SSE_p$ decreases, but the penalty term $2p$ increases.
- When there is no bias, $E(C_p) \approx p$. Hence, good models have $C_p$ values close to $p$.
4. **Model Selection Criteria**:
- **Prediction-focused models**: Consider models with $C_p \leq p$.
- **Parameter estimation-focused models**: Consider models with $C_p \leq 2p - (P - 1)$ to avoid excess bias.
------------------------------------------------------------------------
```{r}
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + rnorm(n, sd=2)
# Full model and candidate models
full_model <- lm(y ~ x1 + x2 + x3)
model_1 <- lm(y ~ x1)
model_2 <- lm(y ~ x1 + x2)
# Extract SSE and calculate Cp
calculate_cp <- function(model, full_model_sse, full_model_mse, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
cp <- (sse / full_model_mse) - (n - 2 * p)
return(cp)
}
# Full model statistics
full_model_sse <- sum(residuals(full_model)^2)
full_model_mse <- mean(residuals(full_model)^2)
# Cp values for each model
cp_1 <- calculate_cp(model_1, full_model_sse, full_model_mse, n)
cp_2 <- calculate_cp(model_2, full_model_sse, full_model_mse, n)
# Display results
cat("C_p values:\n")
cat("Model 1 (y ~ x1):", round(cp_1, 2), "\n")
cat("Model 2 (y ~ x1 + x2):", round(cp_2, 2), "\n")
```
For Mallows's $C_p$ criterion, **lower values** are preferred. Specifically:
1. **Ideal Value**: When the model is a good fit and has the correct number of predictors, $C_p$ should be close to the number of predictors $p$ plus 1 (i.e., $p + 1$).
2. **Model Comparison**: Among competing models, you generally prefer the one with the smallest $C_p$, as long as it is close to $p + 1$.
3. **Overfitting Indicator**: If $C_p$ is significantly lower than $p + 1$, it may suggest overfitting.
4. **Underfitting Indicator**: If $C_p$ is much higher than $p + 1$, it suggests the model is underfitting the data and missing important predictors.
## Akaike Information Criterion (AIC) {#akaike-information-criterion-aic}
The **Akaike Information Criterion (AIC)** is a widely used model selection metric that evaluates the tradeoff between model fit and complexity. It was introduced by Hirotugu Akaike and is rooted in information theory, measuring the relative quality of statistical models for a given dataset.
For a model with $p$ parameters, the AIC is given by:
$$
AIC = n \ln\left(\frac{SSE_p}{n}\right) + 2p
$$
Where:
- $n$ is the number of observations.
- $SSE_p$ is the sum of squared errors for a model with $p$ parameters.
**Key Insights**
1. **Components of AIC**:
- The **first term** ($n \ln(SSE_p / n)$): Reflects the goodness-of-fit of the model. It decreases as $SSE_p$ decreases, meaning the model better explains the data.
- The **second term** ($2p$): Represents a penalty for model complexity. It increases with the number of parameters to discourage overfitting.
2. **Model Selection Principle**:
- Smaller AIC values indicate a better balance between fit and complexity.
- Adding parameters generally reduces $SSE_p$, but increases the penalty term ($2p$). If AIC increases when a parameter is added, that parameter is likely unnecessary.
3. **Tradeoff**:
- AIC emphasizes the tradeoff between:
- **Precision of fit**: Reducing the error in explaining the data.
- **Parsimony**: Avoiding unnecessary parameters to maintain model simplicity.
4. **Comparative Criterion**:
- AIC does not provide an absolute measure of model quality; instead, it compares relative performance. The model with the lowest AIC is preferred.
------------------------------------------------------------------------
```{r}
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + rnorm(n, sd=2)
# Candidate models
model_1 <- lm(y ~ x1)
model_2 <- lm(y ~ x1 + x2)
model_3 <- lm(y ~ x1 + x2 + x3)
# Function to manually compute AIC
calculate_aic <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
aic <- n * log(sse / n) + 2 * p
return(aic)
}
# Calculate AIC for each model
aic_1 <- calculate_aic(model_1, n)
aic_2 <- calculate_aic(model_2, n)
aic_3 <- calculate_aic(model_3, n)
# Display results
cat("AIC values:\n")
cat("Model 1 (y ~ x1):", round(aic_1, 2), "\n")
cat("Model 2 (y ~ x1 + x2):", round(aic_2, 2), "\n")
cat("Model 3 (y ~ x1 + x2 + x3):", round(aic_3, 2), "\n")
```
Interpretation
- Compare the AIC values across models:
- A smaller AIC indicates a model with a better balance between fit and complexity.
- If the AIC increases when moving from one model to another (e.g., from Model 2 to Model 3), the additional parameter(s) in the larger model may not be justified.
Advantages:
- Simple to compute and widely applicable.
- Penalizes overfitting, encouraging parsimonious models.
Limitations:
- Assumes the model errors are normally distributed and independent.
- Does not evaluate absolute model fit, only relative performance.
- Sensitive to sample size; for smaller samples, consider using the corrected version, AICc.
**Corrected AIC (AICc)**
For small sample sizes ($n / p \leq 40$), the corrected AIC, $AICc$, adjusts for the sample size:
$$
AICc = AIC + \frac{2p(p+1)}{n-p-1}
$$
This adjustment prevents over-penalizing models with more parameters when $n$ is small.
## Bayesian Information Criterion (BIC) {#bayesian-information-criterion-bic}
The **Bayesian Information Criterion (BIC)**, also known as Schwarz Criterion, is another popular metric for model selection. It extends the concept of AIC by introducing a stronger penalty for model complexity, particularly when the number of observations is large. BIC is grounded in Bayesian probability theory and provides a framework for selecting the most plausible model among a set of candidates.
For a model with $p$ parameters, the BIC is defined as:
$$
BIC = n \ln\left(\frac{SSE_p}{n}\right) + p \ln(n)
$$
Where:
- $n$ is the number of observations.
- $SSE_p$ is the sum of squared errors for the model with $p$ parameters.
- $p$ is the number of parameters, including the intercept.
**Key Insights**
1. **Components of BIC**:
- The **first term** ($n \ln(SSE_p / n)$): Measures the goodness-of-fit, similar to AIC. It decreases as $SSE_p$ decreases, indicating a better fit to the data.
- The **second term** ($p \ln(n)$): Penalizes model complexity. Unlike AIC's penalty ($2p$), the penalty in BIC increases with $\ln(n)$, making it more sensitive to the number of observations.
2. **Model Selection Principle**:
- Smaller BIC values indicate a better model.
- Adding parameters reduces $SSE_p$, but the penalty term $p \ln(n)$ grows more rapidly than AIC's $2p$ for large $n$. This makes BIC more conservative than AIC in selecting models with additional parameters.
3. **Tradeoff**:
- Like AIC, BIC balances:
- **Precision of fit**: Capturing the underlying structure in the data.
- **Parsimony**: Avoiding overfitting by discouraging unnecessary parameters.
- BIC tends to favor simpler models compared to AIC, particularly when $n$ is large.
4. **Comparative Criterion**:
- BIC, like AIC, is used to compare models. The model with the smallest BIC is preferred.
------------------------------------------------------------------------
```{r}
# Function to manually compute BIC
calculate_bic <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
bic <- n * log(sse / n) + p * log(n)
return(bic)
}
# Calculate BIC for each model
bic_1 <- calculate_bic(model_1, n)
bic_2 <- calculate_bic(model_2, n)
bic_3 <- calculate_bic(model_3, n)
# Display results
cat("BIC values:\n")
cat("Model 1 (y ~ x1):", round(bic_1, 2), "\n")
cat("Model 2 (y ~ x1 + x2):", round(bic_2, 2), "\n")
cat("Model 3 (y ~ x1 + x2 + x3):", round(bic_3, 2), "\n")
```
Interpretation
- Compare the BIC values across models:
- Smaller BIC values suggest a better model.
- If BIC increases when moving to a larger model, the added complexity may not justify the reduction in $SSE_p$.
**Comparison with AIC**
| Criterion | Penalty Term | Sensitivity to Sample Size | Preferred Model Selection |
|-----------|--------------|----------------------------|---------------------------|
| **AIC** | $2p$ | Less sensitive | More parameters |
| **BIC** | $p \ln(n)$ | More sensitive | Simpler models |
- BIC generally prefers simpler models than AIC, especially when $n$ is large.
- In small datasets, AIC may perform better because BIC's penalty grows significantly with $\ln(n)$.
Advantages:
- Strong penalty for complexity makes it robust against overfitting.
- Incorporates sample size explicitly, favoring simpler models as \$n\$ grows.
- Easy to compute and interpret.
Limitations:
- Assumes model errors are normally distributed and independent.
- May underfit in smaller datasets where the penalty term dominates.
- Like AIC, BIC is not an absolute measure of model quality but a relative one.
## Hannan-Quinn Criterion (HQC) {#hannan-quinn-criterion-hqc}
The **Hannan-Quinn Criterion (HQC)** is a statistical metric for model selection, similar to AIC and BIC. It evaluates the tradeoff between model fit and complexity, offering a middle ground between the conservative penalty of BIC and the less stringent penalty of AIC. HQC is especially useful in time-series modeling and situations where large datasets are involved.
The HQC for a model with $p$ parameters is defined as:
$$
HQC = n \ln\left(\frac{SSE_p}{n}\right) + 2p \ln(\ln(n))
$$
Where:
- $n$: Number of observations.
- $SSE_p$: Sum of Squared Errors for the model with $p$ predictors.
- $p$: Number of parameters, including the intercept.
------------------------------------------------------------------------
**Key Insights**
1. **Components**:
- The **first term** ($n \ln(SSE_p / n)$): Measures the goodness-of-fit, similar to AIC and BIC. Smaller SSE indicates a better fit.
- The **second term** ($2p \ln(\ln(n))$): Penalizes model complexity. The penalty grows logarithmically with the sample size, similar to BIC but less severe.
2. **Model Selection Principle**:
- Smaller HQC values indicate a better balance between model fit and complexity.
- Models with lower HQC are preferred.
3. **Penalty Comparison**:
- HQC's penalty lies between that of AIC and BIC:
- AIC: $2p$ (less conservative, favors complex models).
- BIC: $p \ln(n)$ (more conservative, favors simpler models).
- HQC: $2p \ln(\ln(n))$ (balances AIC and BIC).
4. **Use Case**:
- HQC is particularly suited for large datasets or time-series models where overfitting is a concern, but BIC may overly penalize model complexity.
------------------------------------------------------------------------
```{r}
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + x3 + rnorm(n, sd=2)
# Prepare models
data <- data.frame(y, x1, x2, x3, x4)
model_1 <- lm(y ~ x1, data = data)
model_2 <- lm(y ~ x1 + x2, data = data)
model_3 <- lm(y ~ x1 + x2 + x3, data = data)
# Function to calculate HQC
calculate_hqc <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
hqc <- n * log(sse / n) + 2 * p * log(log(n))
return(hqc)
}
# Calculate HQC for each model
hqc_1 <- calculate_hqc(model_1, n)
hqc_2 <- calculate_hqc(model_2, n)
hqc_3 <- calculate_hqc(model_3, n)
# Display results
cat("HQC values:\n")
cat("Model 1 (y ~ x1):", round(hqc_1, 2), "\n")
cat("Model 2 (y ~ x1 + x2):", round(hqc_2, 2), "\n")
cat("Model 3 (y ~ x1 + x2 + x3):", round(hqc_3, 2), "\n")
```
Interpretation
1. **Comparing HQC Values**:
- Smaller HQC values indicate a better balance between goodness-of-fit and parsimony.
- Select the model with the smallest HQC.
2. **Tradeoffs**:
- HQC balances fit and complexity more conservatively than AIC but less so than BIC.
- It is particularly useful when overfitting is a concern but avoiding overly simplistic models is also important.
Comparison with Other Criteria
| Criterion | Penalty Term | Sensitivity to Sample Size | Preferred Model Selection |
|-----------|------------------|----------------------------|----------------------------|
| **AIC** | $2p$ | Low | Favors more complex models |
| **BIC** | $p \ln(n)$ | High | Favors simpler models |
| **HQC** | $2p \ln(\ln(n))$ | Moderate | Balances fit and parsimony |
Advantages:
- Less sensitive to sample size than BIC, avoiding excessive penalization for large datasets.
- Provides a balanced approach to model selection, reducing the risk of overfitting while avoiding overly simplistic models.
- Particularly useful in time-series analysis.
Limitations:
- Like AIC and BIC, assumes model errors are normally distributed and independent.
- HQC is not as widely implemented in statistical software, requiring custom calculations.
Practical Considerations
- **When to use HQC?**
- When both AIC and BIC provide conflicting recommendations.
- For large datasets or time-series models where BIC may overly penalize complexity.
- **When to use AIC or BIC?**
- AIC for smaller datasets or when the goal is prediction.
- BIC for large datasets or when parsimony is critical.
## Minimum Description Length (MDL) {#minimum-description-length-mdl}
The **Minimum Description Length (MDL)** principle is a model selection framework rooted in information theory. It balances model complexity and goodness-of-fit by seeking the model that minimizes the total length of encoding the data and the model itself. MDL is a generalization of other model selection criteria like AIC and BIC but offers a more theoretical foundation.
------------------------------------------------------------------------
**Theoretical Foundation**
MDL is based on the idea that the best model is the one that compresses the data most efficiently. It represents a tradeoff between:
1. **Model Complexity**: The cost of describing the model, including the number of parameters.
2. **Data Fit**: The cost of describing the data given the model.
The total description length is expressed as:
$$
L(M, D) = L(M) + L(D | M)
$$
Where:
- $L(M)$: The length of encoding the model (complexity of the model).
- $L(D | M)$: The length of encoding the data given the model (fit to the data).
**Key Insights**
1. **Model Complexity (**$L(M)$):
- More complex models require longer descriptions, as they involve more parameters.
- Simpler models are favored unless the added complexity significantly improves the fit.
2. **Data Fit (**$L(D | M)$):
- Measures how well the model explains the data.
- Poorly fitting models require more bits to describe the residual error.
3. **Tradeoff**:
- MDL balances these two components, selecting the model that minimizes the total description length.
------------------------------------------------------------------------
Connection to Other Criteria
- MDL is closely related to BIC. In fact, the BIC criterion can be derived as an approximation of MDL for certain statistical models:
$$
BIC = n \ln(SSE_p / n) + p \ln(n)
$$
- However, MDL is more flexible and does not rely on specific assumptions about the error distribution.
------------------------------------------------------------------------
Practical Use Cases
- **Time-Series Modeling**: MDL is particularly effective for selecting models in time-series data, where overfitting is common.
- **Machine Learning**: MDL is used in regularization techniques and decision tree pruning to prevent overfitting.
- **Signal Processing**: In applications such as compression and coding, MDL directly guides optimal model selection.
------------------------------------------------------------------------
```{r}
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + x3 + rnorm(n, sd=2)
# Prepare models
data <- data.frame(y, x1, x2, x3)
model_1 <- lm(y ~ x1, data = data)
model_2 <- lm(y ~ x1 + x2, data = data)
model_3 <- lm(y ~ x1 + x2 + x3, data = data)
# Function to calculate MDL
calculate_mdl <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
mdl <- p * log(n) + n * log(sse / n)
return(mdl)
}
# Calculate MDL for each model
mdl_1 <- calculate_mdl(model_1, n)
mdl_2 <- calculate_mdl(model_2, n)
mdl_3 <- calculate_mdl(model_3, n)
# Display results
cat("MDL values:\n")
cat("Model 1 (y ~ x1):", round(mdl_1, 2), "\n")
cat("Model 2 (y ~ x1 + x2):", round(mdl_2, 2), "\n")
cat("Model 3 (y ~ x1 + x2 + x3):", round(mdl_3, 2), "\n")
```
Interpretation
1. **Choosing the Best Model**:
- The model with the smallest MDL value is preferred, as it achieves the best tradeoff between fit and complexity.
2. **Practical Implications**:
- MDL discourages overfitting by penalizing complex models that do not significantly improve data fit.
Advantages:
- Theoretically grounded in information theory.
- Offers a natural framework for balancing complexity and fit.
- Flexible and can be applied across various modeling frameworks.
Limitations:
- Computationally intensive, especially for non-linear models.
- Requires careful formulation of \$L(M)\$ and \$L(D \| M)\$ for non-standard models.
- Less common in standard statistical software compared to AIC or BIC.
## Prediction Error Sum of Squares (PRESS) {#prediction-error-sum-of-squares-press}
The **Prediction Error Sum of Squares (PRESS)** statistic measures the predictive ability of a model by evaluating how well it performs on data not used in fitting the model. PRESS is particularly useful for assessing model validity and identifying overfitting.
The PRESS statistic for a model with $p$ parameters is defined as:
$$
PRESS_p = \sum_{i=1}^{n} (Y_i - \hat{Y}_{i(i)})^2
$$
Where:
- $\hat{Y}_{i(i)}$ is the prediction of the $i$-th response when the $i$-th observation is **omitted** during model fitting.
- $Y_i$ is the observed response for the $i$-th observation.
------------------------------------------------------------------------
**Key Insights**
1. **Leave-One-Out Cross-Validation (LOOCV)**:
- PRESS is computed by excluding each observation one at a time and predicting its response using the remaining data.
- This process evaluates the model's generalizability and reduces overfitting.
2. **Model Selection Principle**:
- Smaller values of $PRESS_p$ indicate better predictive performance.
- A small $PRESS_p$ suggests that the model captures the underlying structure of the data without overfitting.
3. **Computational Complexity**:
- Computing $\hat{Y}_{i(i)}$ for each observation can be computationally intensive for models with large $p$ or datasets with many observations.
- Alternative approximations (e.g., using leverage values) can simplify computations.
------------------------------------------------------------------------
```{r}
# Function to compute PRESS
calculate_press <- function(model) {
residuals <- residuals(model)
h <- lm.influence(model)$hat # leverage values
press <- sum((residuals / (1 - h))^2) # PRESS formula using leverage
return(press)
}
# Calculate PRESS for each model
press_1 <- calculate_press(model_1)
press_2 <- calculate_press(model_2)
press_3 <- calculate_press(model_3)
# Display results
cat("PRESS values:\n")
cat("Model 1 (y ~ x1):", round(press_1, 2), "\n")
cat("Model 2 (y ~ x1 + x2):", round(press_2, 2), "\n")
cat("Model 3 (y ~ x1 + x2 + x3):", round(press_3, 2), "\n")
```
Interpretation
- Compare the PRESS values across models:
- Models with smaller PRESS values are preferred as they exhibit better predictive ability.
- A large PRESS value indicates potential overfitting or poor model generalizability.
Advantages:
- Provides an unbiased measure of predictive performance.
- Helps identify overfitting by simulating the model's performance on unseen data.
Limitations:
- Computationally intensive for large datasets or models with many predictors.
- Sensitive to influential observations; high-leverage points can disproportionately affect results.
**Alternative Approaches**
To address the computational challenges of PRESS, alternative methods can be employed:
- **Approximation using leverage values**: As shown in the example, leverage values simplify the calculation of $\hat{Y}_{i(i)}$.
- **K-Fold Cross-Validation**: Dividing the dataset into $k$ folds reduces computational burden compared to LOOCV while still providing robust estimates.
## Best Subsets Algorithm
The **Best Subsets Algorithm** is a systematic method for selecting the best combination of predictors. Unlike exhaustive search, which evaluates all possible subsets of predictors, this algorithm efficiently narrows the search space while guaranteeing the identification of the best subset for each size.
- The algorithm is based on the **"leap and bounds"** method introduced by [@furnival2000regressions].
- It combines:
- **Comparison of SSE**: Evaluates models by their Sum of Squared Errors.
- **Control over sequence**: Optimizes the order in which subset models are computed.
- **Guarantees**: Finds the best $m$ subset models within each subset size while reducing computational burden compared to evaluating all possible subsets.
------------------------------------------------------------------------
**Key Features**
1. **Subset Comparison**:
- The algorithm ranks subsets based on a criterion such as $R^2$, adjusted $R^2$, AIC, or BIC.
- It evaluates models of varying sizes, starting from 1 predictor to $p$ predictors.
2. **Efficiency**:
- By leveraging "leap and bounds," the algorithm avoids evaluating subsets unlikely to yield the best results.
- This reduces the computational cost significantly compared to evaluating $2^p$ subsets in exhaustive search.
3. **Output**:
- Produces the best subsets for each model size, which can be compared using criteria like AIC, BIC, or PRESS.
------------------------------------------------------------------------
```{r}
# Load the leaps package
library("leaps")
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + x3 + rnorm(n, sd=2)
# Prepare data for best subsets model
data <- data.frame(y, x1, x2, x3, x4)
# Perform best subsets model
best_subsets <- regsubsets(y ~ ., data = data, nvmax = 4)
# Summarize results
best_summary <- summary(best_subsets)
# Display model selection metrics
cat("Best Subsets Summary:\n")
cat("Adjusted R^2:\n", best_summary$adjr2, "\n")
cat("Cp:\n", best_summary$cp, "\n")
cat("BIC:\n", best_summary$bic, "\n")
# Visualize results
plot(best_subsets, scale = "adjr2")
title("Best Subsets: Adjusted R^2")
```
Interpretation
1. **Model Size**:
- Examine the metrics for models with 1, 2, 3, and 4 predictors.
- Choose the model size that optimizes your preferred metric (e.g., maximized adjusted $R^2$, minimized BIC).
2. **Model Comparison**:
- For small datasets, adjusted $R^2$ is often a reliable criterion.
- For larger datasets, use BIC or AIC to avoid overfitting.
3. **Efficiency**:
- The algorithm evaluates far fewer models than an exhaustive search while still guaranteeing optimal results for each subset size.
Advantages:
- Computationally efficient compared to evaluating all possible subsets.
- Guarantees the best subsets for each model size.
- Flexibility to use different selection criteria (e.g., $R^2$, AIC, BIC).
Limitations:
- May become computationally intensive for very large $p$ (e.g., hundreds of predictors).
- Assumes linear relationships among predictors and the outcome.
Practical Considerations
- **When to use Best Subsets?**
- For datasets with a moderate number of predictors ($p \leq 20$).
- When you need an optimal solution for each subset size.
- **Alternatives**:
- **Stepwise Selection**: Faster but less reliable.
- **Regularization Techniques**: LASSO and Ridge regression handle large $p$ and collinearity effectively.
## Stepwise Selection Methods
Stepwise selection procedures are iterative methods for selecting predictor variables. These techniques balance model simplicity and predictive accuracy by systematically adding or removing variables based on predefined criteria.
------------------------------------------------------------------------
**Notes:**
- Computer implementations often replace exact F-values with "significance" levels:
- **SLE**: Significance level to enter.
- **SLS**: Significance level to stay.
- These thresholds serve as guides rather than strict tests of significance.
Balancing SLE and SLS:
- Large **SLE** values: May include too many variables, risking overfitting.
- Small **SLE** values: May exclude important variables, leading to underfitting and overestimation of $\sigma^2$.
- A reasonable range for SLE is **0.05 to 0.5**.
- Practical advice:
- If **SLE \> SLS**, cycling may occur (adding and removing the same variable repeatedly). To fix this, set $SLS = SLE / 2$ [@bendel1977comparison].
- If **SLE \< SLS**, the procedure becomes conservative, retaining variables with minimal contribution.
### Forward Selection
**Forward Selection** starts with an empty model (only the intercept) and sequentially adds predictors. At each step, the variable that most improves the model fit (based on criteria like $R^2$, AIC, or F-statistic) is added. The process stops when no variable improves the model significantly.
Steps
1. Begin with the null model: $Y = \beta_0$.
2. Evaluate each predictor's contribution to the model (e.g., using F-statistic or AIC).
3. Add the predictor with the most significant improvement to the model.
4. Repeat until no remaining variable significantly improves the model.
### Backward Elimination
**Backward Elimination** starts with the full model, containing all predictors, and sequentially removes the least significant predictor (based on criteria like p-value or F-statistic). The process stops when all remaining predictors meet the significance threshold.
Steps
1. Begin with the full model: $Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$.
2. Identify the predictor with the smallest contribution to the model (e.g., highest p-value).
3. Remove the least significant predictor.
4. Repeat until all remaining predictors are statistically significant.
### Stepwise (Both Directions) Selection {#stepwise-both-directions-selection}
**Stepwise Selection** combines forward selection and backward elimination. At each step, it evaluates whether to add or remove predictors based on predefined criteria. This iterative process ensures that variables added in earlier steps can be removed later if they no longer contribute significantly.
Steps
1. Start with the null model or a user-specified initial model.
2. Evaluate all predictors not in the model for inclusion (forward step).
3. Evaluate all predictors currently in the model for removal (backward step).
4. Repeat steps 2 and 3 until no further addition or removal improves the model.
### Comparison of Methods
| Method | Starting Point | Adds Predictors | Removes Predictors | Pros | Cons |
|--------------------------|------------------------|-----------------|--------------------|---------------------------------------|---------------------------------------|
| **Forward Selection** | Null model (no terms) | Yes | No | Simple, fast, useful for small models | Cannot remove irrelevant predictors |
| **Backward Elimination** | Full model (all terms) | No | Yes | Removes redundant variables | May exclude important variables early |
| **Stepwise Selection** | User-defined or null | Yes | Yes | Combines flexibility of both methods | More computationally intensive |
```{r}
# Simulated Data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
y <- 5 + 3 * x1 - 2 * x2 + x3 + rnorm(n, sd = 2)
# Prepare data
data <- data.frame(y, x1, x2, x3, x4)
# Null and Full Models
null_model <- lm(y ~ 1, data = data)
full_model <- lm(y ~ ., data = data)
# Forward Selection
forward_model <- step(
null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward"
)
# Backward Elimination
backward_model <- step(full_model, direction = "backward")
# Stepwise Selection (Both Directions)
stepwise_model <- step(
null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both"
)
# Summarize Results
cat("Forward Selection:\n")
print(summary(forward_model))
cat("\nBackward Elimination:\n")
print(summary(backward_model))
cat("\nStepwise Selection:\n")
print(summary(stepwise_model))
```
Interpretation of Results
1. **Forward Selection**:
- Starts with no predictors and sequentially adds variables.
- Variables are chosen based on their contribution to improving model fit (e.g., reducing SSE, increasing $R^2$).
2. **Backward Elimination**:
- Begins with all predictors and removes the least significant one at each step.
- Stops when all remaining predictors meet the significance criterion.
3. **Stepwise Selection**:
- Combines forward selection and backward elimination.
- At each step, evaluates whether to add or remove variables based on predefined criteria.
Practical Considerations
- **SLE and SLS Tuning**:
- Choose thresholds carefully to balance model simplicity and predictive performance.
- For most applications, set $SLS = SLE / 2$ to prevent cycling.
- **Order of Entry**:
- Stepwise selection is unaffected by the order of variable entry. Results depend on the data and significance criteria.
- **Automated Procedures**:
- Forward selection is simpler but less robust than forward stepwise.
- Backward elimination works well when starting with all predictors.
Advantages:
- Automates variable selection, reducing manual effort.
- Balances model fit and parsimony using predefined criteria.
- Flexible: Works with different metrics (e.g., SSE, $R^2$, AIC, BIC).
Limitations:
- Can be sensitive to significance thresholds (SLE, SLS).
- Risk of excluding important variables in datasets with multicollinearity.
- May overfit or underfit if SLE/SLS thresholds are poorly chosen.
Practical Use Cases
- **Forward Stepwise**:
- When starting with minimal knowledge of predictor importance.
- **Backward Elimination**:
- When starting with many predictors and needing to reduce model complexity.
- **Stepwise (Both Directions)**:
- For a balanced approach that adapts as variables are added or removed.
## Branch-and-Bound Algorithm
The **Branch-and-Bound Algorithm** is a systematic optimization method used for solving subset selection problems [@furnival2000regressions]. It identifies the best subsets of predictors by exploring the solution space efficiently, avoiding the need to evaluate all possible subsets.
The algorithm is particularly suited for problems with a large number of potential predictors. It systematically evaluates subsets of predictors, using bounds to prune the search space and reduce computational effort.
- **Branching**: Divides the solution space into smaller subsets (branches).
- **Bounding**: Calculates bounds on the best possible solution within each branch to decide whether to explore further or discard.
------------------------------------------------------------------------
**Key Features**
1. **Subset Selection**:
- Used to identify the best subset of predictors.
- Evaluates subsets based on criteria like $R^2$, adjusted $R^2$, AIC, BIC, or Mallows's $C_p$.
2. **Efficiency**:
- Avoids exhaustive search, which evaluates $2^p$ subsets for $p$ predictors.
- Reduces the computational burden by eliminating branches that cannot contain the optimal solution.
3. **Guarantee**:
- Finds the globally optimal subset for the specified criterion.
------------------------------------------------------------------------
**Algorithm Steps**
1. **Initialization**:
- Start with the full set of predictors.
- Define the criterion for evaluation (e.g., adjusted $R^2$, AIC, BIC).
2. **Branching**:
- Divide the predictors into smaller subsets (branches) systematically.
3. **Bounding**:
- Compute bounds for the criterion in each branch.