-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathabd17-lion.qmd
230 lines (178 loc) · 8.47 KB
/
abd17-lion.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Predicting Age from Lion Nose"
author:
- name: Frank Harrell
affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
html:
self-contained: true
number-sections: true
number-depth: 3
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 3
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: show
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
execute:
warning: false
message: false
major: regression modeling; ABD
minor: ordinary least squares
---
```{r setup}
require(rms) # also loads Hmisc package
options(prType='html') # makes nicer output from rms functions
getRs('abd.r')
require(ggplot2)
```
# Problem
* The lion's nose, ABD example 17.1, page 541 2nd, 546 3rd
* Dataset: 17-e-1
* N observations: 32 male lions
* Response: Age (years)
* Predictor: Proportion of black pigmentation on nose
# Data Import
```{r import}
d <- getabd('17-e-1')
# Give variables short names and create labels and units
# The .q function in Hmisc does away with need to quote
# words that are legal R names
d <- upData(d,
rename= .q(ageInYears = age,
proportionBlack = prop),
labels= .q(age = Age,
prop = 'Proportion Black Nose'),
units = .q(age = years))
contents(d) # data dictionary for imported dataset
```
# Descriptive Statistics
```{r desc}
describe(d)
```
::: {.column-margin}
When using the `html` function to translate the output to html there is a small bug in `Quarto` that keeps the name and label of the first variable from printing.
:::
# Why would understanding the Age-Nose relationship be helpful?
# Construct a scatter plot of the response and explanatory variable
A scatterplot is an excellent way to present the bivariate relationship when there are not many ties on x-y combinations. Ties would create coincident points that would not be visible on the plot. Create a function that counts the number of coincident points to within the resolution we expect the plot symbols to have (e.g., 300 points in both x- and y-directions). We'll add the number of coincident points as a subtitle on the graph.
```{r}
# with(d, ...) evaluates ... using variables inside d
# nCoincident is in the Hmisc package
nc <- with(d, nCoincident(prop, age))
nc
```
Make the plot first without capitalizing on labels/units.
```{r}
ggplot(d, aes(x=prop, y=age)) + geom_point() +
xlab('Proportion Black Nose') + ylab('Age, years') +
labs(title = 'Lion Nose Blackness as Predictor of Age',
caption = paste(nc, 'coincident points'))
```
[See [this](http://hbiostat.org/rflow/graphics.html#ggplot2) for more about plotting labels.]{.aside}
```{r}
ggplot(d, aes(x=prop, y=age)) + geom_point() +
xlab(hlab(prop)) + ylab(hlab(age)) +
labs(caption=paste(nc, 'coincident points'))
```
# How would you describe the relationship in the plot?
# Write down the linear regression model of proportion vs. age
$$E(\text{age} | \text{prop}) = \beta_{0} + \beta_{1}\times \text{prop}$$
$$\text{age} = \beta_{0} + \beta_{1}\times \text{prop} + \epsilon$$
* Constant (intercept) $\beta_0$ = $E(\text{age} | \text{prop}=0)$
* Slope $\beta_1$: change in mean age for prop going from 0 to 1 (not realistic)
* Residual SD $\sigma$: SD of residuals (errors $\epsilon$); $\sigma^2$ is the variance of age not explained by prop
* How would the model change if the predictor were on the percent scale?
* How would the model change if age were on the decade scale?
* What scale is $\sigma^2$?
# Fit a linear regression model
We use the `rms` package `ols` (ordinary least squares) function, which enhances the built-in R function `lm`. First, save distributional summaries of all the variables so that predictor ranges can be set automatically for plotting.
```{r}
dd <- datadist(d); options(datadist='dd')
f <- ols(age ~ prop, data=d, x=TRUE) # x=TRUE needed for studentized residuals later
f
# Create LaTeX math notation for fitted model
# Quarto / RMarkdown will automatically typeset this
latex(f)
```
* What are the parameter estimates? $\hat{\beta}_{0} = 0.88, \hat{\beta}_{1} = 10.65, \hat{\sigma} = 1.67$
* What is the uncertainty of the model parameters? Uncertainties of individual parameters may be assessed by standard errors. For example if the model is correctly specified the root mean squared error in estimating the slope is estimated to be 1.51. Uncertainty in $\hat{\sigma}$ is not commonly estimated.
# Plot the linear regression model with confidence bands for the mean
```{r}
# Predict(f, prop) will get predicted values from the 10th smallest
# to 10th largest observed prop
# Let's override that and plot over the whole observed range
# Add raw data to plot
# Type ?Predict or ?ggplot.Predict to get function documentation
propr <- with(d, range(prop)) # creates vector of length 2 with min&max
# Compute vector of 100 equally spaced proportions from min to max
pseq <- seq(propr[1], propr[2], length=100)
# Omit geom_point(...) to omit raw data points
ggplot(Predict(f, prop=pseq)) + geom_point(aes(prop, age), d)
```
* Assess the model fit
# Make a scatter plot of raw & studentized residuals against prop
Studentized residuals are normalized for the standard deviation of the estimated mean and leave out the current observation for computing the regression coefficient estimates that are used to compute residuals.
First plot the two types of residuals against each other to help in understanding.
```{r}
# Type ?residuals.ols to get documentation for `resid` which is
# a short name for `residuals.ols`
r <- resid(f)
rstud <- resid(f, type='studentized') # needs rms 6.4-1
ggplot(mapping=aes(r, rstud)) + geom_point()
```
For this one-predictor case the two types of residuals are telling us the same thing, just on different scales. So just plot the ordinary residual vs. the predictor.
```{r}
ggplot(d, aes(prop, r)) + geom_point()
```
* What aspect of model fit does this plot communicate?
* What would the plot look like if the model fits well?
* What would the plot look like if the model fits poorly?
# Using the model
* How might you use the model?
* What is the difference between "estimating the mean" and "prediction of age for a new lion"?
* Is there a difference between estimating a population parameter and predicting an individual outcome?
* Which of the two (population vs individual outcome) are we "more certain"?
# Plot the linear regression model fit with two types of bands
The `rms` package `Predict` function by default computes confidence limits for estimating the population mean. This is changed using the `conf.type` argument.
```{r}
a <- Predict(f, prop=pseq) # normally Predict(f, prop)
b <- Predict(f, prop=pseq, conf.type='individual')
ab <- rbind(Mean=a, Individual=b)
# Plot predicted means and two types of bands without raw data
ggplot(ab) +
guides(color=guide_legend(title='')) + # omit legend title
theme(legend.position='bottom') # put legend at bottom
# To add raw data to this plot the raw dataset must have a variable
# added to it (.set.) that rbind used to distinguish types of bands
# We set .set. to blank for points as this is not applicable
d$.set. <- ''
ggplot(ab, legend.position='none') +
geom_point(aes(prop, age), data=d)
# Another approach to including raw data, not using rbind
ggplot(b) + geom_point(aes(prop, age), d) +
geom_ribbon(aes(x=prop, ymin=lower, ymax=upper), a, alpha=0.2, fill='red') +
guides(color=guide_legend(title='')) + # omit legend title
theme(legend.position='bottom') # put legend at bottom
```
# A different summary measure of individual lion predictive accuracy
Compute the mean absolute prediction error (mean $|\hat{\epsilon}|$). This has the advantage of
* quantifying accuracy on the original age scale
* being robust to outliers
```{r}
mean(abs(r)) # = mean(abs(resid(f)))
```
$\rightarrow$ predicted ages based on proportion black nose are off by `r round(mean(abs(r)), 2)` years on the average. This can be compared against the typical difference in actual ages between lions in the sample (Gini's mean difference) of only `r round(GiniMd(d$age), 1)`. This is in line with the estimated $R^{2}_\text{adj}$ of only 0.61. The model is good at predicting long-term average but not so much for estimating ages of individual lions.
# Computing Environment
```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```