-
Notifications
You must be signed in to change notification settings - Fork 0
/
05binary.Rmd
167 lines (129 loc) · 3.91 KB
/
05binary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# Prediction from binary outcome
```{block, type='rmdcomment'}
In this chapter, we will talk about Regression that deals with prediction of binary outcomes. We will use logistic regression to build the first prediction model.
```
- Watch the video describing this chapter [![video](https://img.youtube.com/vi/YPOP82Wxfy8/maxresdefault.jpg)](https://youtu.be/YPOP82Wxfy8)
```{r setup05, include=FALSE}
require(Publish)
```
## Read previously saved data
```{r}
ObsData <- readRDS(file = "data/rhcAnalytic.RDS")
```
## Outcome levels (factor)
- Label
- Possible values of outcome
```{r}
levels(ObsData$Death)=c("No","Yes") # this is useful for caret
# ref: https://tinyurl.com/caretbin
class(ObsData$Death)
table(ObsData$Death)
```
## Measuring prediction error
```{block, type='rmdcomment'}
- Brier score
- Brier score 0 means perfect prediction, and
- close to zero means better prediction,
- 1 being the worst prediction.
- Less accurate forecasts get higher score in Brier score.
- AUC
- The area under a ROC curve is called as a c statistics.
- c being 0.5 means random prediction and
- 1 indicates perfect prediction
```
### Prediction for death
```{block, type='rmdcomment'}
In this section, we show the regression fitting when outcome is binary (death).
```
## Variables
```{r vars, cache=TRUE, echo = TRUE}
baselinevars <- names(dplyr::select(ObsData,
!c(Length.of.Stay,Death)))
baselinevars
```
## Model
```{r reg3, cache=TRUE, echo = TRUE, results='hide'}
# adjust covariates
out.formula2 <- as.formula(paste("Death~ ",
paste(baselinevars,
collapse = "+")))
saveRDS(out.formula2, file = "data/form2.RDS")
fit2 <- glm(out.formula2, data = ObsData,
family = binomial(link = "logit"))
require(Publish)
adj.fit2 <- publish(fit2, digits=1)$regressionTable
```
```{r reg3a, cache=TRUE, echo = TRUE}
out.formula2
adj.fit2
```
## Measuring prediction error
### AUC
```{r}
require(pROC)
obs.y2<-ObsData$Death
pred.y2 <- predict(fit2, type = "response")
rocobj <- roc(obs.y2, pred.y2)
rocobj
plot(rocobj)
auc(rocobj)
```
### Brier Score
```{r}
require(DescTools)
BrierScore(fit2)
```
## Cross-validation using caret
### Basic setup
```{r cvbin, cache= TRUE}
# Using Caret package
set.seed(504)
# make a 5-fold CV
require(caret)
ctrl<-trainControl(method = "cv", number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# fit the model with formula = out.formula2
# use training method glm (have to specify family)
fit.cv.bin<-train(out.formula2, trControl = ctrl,
data = ObsData, method = "glm",
family = binomial(),
metric="ROC")
fit.cv.bin
```
### Extract results from each test data
```{r cvbin2, cache= TRUE}
summary.res <- fit.cv.bin$resample
summary.res
mean(fit.cv.bin$resample$ROC)
sd(fit.cv.bin$resample$ROC)
```
### More options
```{r cvbincs, cache= TRUE}
ctrl<-trainControl(method = "cv", number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
fit.cv.bin<-train(out.formula2, trControl = ctrl,
data = ObsData, method = "glm",
family = binomial(),
metric="ROC",
preProc = c("center", "scale"))
fit.cv.bin
```
## Variable selection
We can also use stepwise regression that uses AIC as a criterion.
```{r cvbinaic, cache= TRUE, results = 'hide', warning=FALSE, message=FALSE, error=FALSE}
set.seed(504)
ctrl<-trainControl(method = "cv", number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
fit.cv.bin.aic<-train(out.formula2, trControl = ctrl,
data = ObsData, method = "glmStepAIC",
direction ="backward",
family = binomial(),
metric="ROC")
```
```{r cvbinaic2, cache= TRUE}
fit.cv.bin.aic
summary(fit.cv.bin.aic)
```