forked from mjskay/uncertainty-examples
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmarginal-mean-logit.Rmd
executable file
·229 lines (183 loc) · 5.72 KB
/
marginal-mean-logit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "Marginalizing logit models"
author: "Matthew Kay"
date: "2022-11-29"
output: github_document
---
```{r setup, warning=FALSE, message=FALSE}
library(tidyverse)
library(distributional)
library(ggdist)
library(tidybayes)
library(rstanarm)
theme_set(theme_ggdist())
```
N.B. this is an attempt to think through some ideas from [this Twitter thread](https://twitter.com/adamjnafa/status/1597504851420024832), **DO NOT take it as authoritative**.
## Data
Let's generate some data from a population with one predictor (`x`) having two categories in
whose relative proportion is not 50-50, and one response variable (`y`) whose
outcome probability depends on `x`:
```{r}
p_x1 = 0.7
p_y1 = c(
x0 = 0.3,
x1 = 0.8
)
# thus the marginal P(y == 1) is...
marginal_p_y1 = p_y1[["x0"]] * (1 - p_x1) + p_y1[["x1"]] * p_x1
set.seed(1234)
n = 40
df = tibble(
x = paste0("x", rbinom(n, 1, p_x1)),
y = rbinom(n, 1, p_y1[x])
)
df
```
Visualizing the data (compared to the true value in red):
```{r data}
df |>
group_by(x) |>
summarize(p_y1_obs = mean(y)) |>
ggplot(aes(x = x, y = p_y1_obs)) +
geom_point(aes(y = p_y1[x]), pch = "-", color = "red", size = 6) +
geom_point() +
scale_y_continuous(limits = c(0, 1))
```
Let's fit a model:
```{r results='hide'}
m = stan_glm(y ~ x, data = df, family = binomial)
```
```{r}
m
```
## Marginal means
### Assuming equal probabilities of predictors
If we want the marginal mean for `Pr(y == 1)`, we can do a few things. One is
calculate it assuming equal proportions of the two categories:
```{r mean_equal}
mean_equal_df = df |>
tidyr::expand(x) |>
add_epred_draws(m) |>
# marginalize out x, where P(x1) == P(x2) == 0.5
group_by(.draw) |>
summarise(.epred = mean(.epred)) |>
mutate(method = "equal prob")
mean_equal_df |>
ggplot(aes(x = .epred)) +
stat_halfeye() +
geom_vline(xintercept = marginal_p_y1) +
xlim(c(0,1)) +
labs(
title = "Estimate of Pr(Y = 1)",
subtitle = "Assuming Pr(X = 1) = 0.5"
)
```
Note how this leads to a biased result because we assumed `Pr(x = x1) = 0.5` yet it
is actually 0.7. This makes sense if we think about what we're doing mathematically,
which is something like:
$$
\begin{eqnarray}
E_{X_\mathrm{equal}}(Y) &=& E(Y|X_\mathrm{equal} = 0)\cdot \Pr(X_\mathrm{equal} = 0) + E(Y|X_\mathrm{equal} = 1)\cdot \Pr(X_\mathrm{equal} = 1)\\
&=& E(Y|X_\mathrm{equal} = 0) \cdot 0.5 + E(Y|X_\mathrm{equal} = 1) \cdot 0.5
\end{eqnarray}
$$
When we want to be marginalizing (ideally) over $X$, i.e. the population distribution of `x`, not
an assumed $X_equal$.
### Using sample proportions of predictors
We could assume sample proportions of predictors reflects the population. The sample
proportions are:
```{r}
prop.table(table(df$x))
```
```{r mean_sample}
mean_sample_df = df |>
add_epred_draws(m) |>
# marginalize out x, using sample proportions of x
group_by(.draw) |>
summarise(.epred = mean(.epred)) |>
mutate(method = "sample prob")
mean_sample_df |>
ggplot(aes(x = .epred)) +
stat_halfeye() +
geom_vline(xintercept = marginal_p_y1) +
xlim(c(0,1)) +
labs(
title = "Estimate of Pr(Y = 1)",
subtitle = "Assuming Pr(X = 1) = Pr(X_obs = 1) (observed proportion in sample)"
)
```
Now we have a more sensible estimate, but I would bet that the coverage is not
nominal if we checked its calibration --- because we used the sample proportion for
`x == x1`, which is `r prop.table(table(df$x))["x1"]`, instead of the true proportion,
which is `r p_x1`.
In essence, if we observed $X_\mathrm{obs}$, we are now doing:
$$
\begin{eqnarray}
E_{X_\mathrm{obs}}(Y) &=& E(Y|X_\mathrm{obs} = 0)\cdot \Pr(X_\mathrm{obs} = 0) + E(Y|X_\mathrm{obs} = 1)\cdot \Pr(X_\mathrm{obs} = 1)\\
\end{eqnarray}
$$
Yet there is uncertainty in $X_\mathrm{obs}$ as an approximation of the population $X$, and
this uncertainty is not propogated in the above equation.
### Using a model of the population x
What we really want to do is have some approximation of the population $X$ that
incorporates uncertainty, say $\tilde{X}$. Then we can do something like:
$$
\begin{eqnarray}
E_{\tilde{X}}(Y) &=& E(Y|\tilde{X} = 0)\cdot \Pr(\tilde{X} = 0) + E(Y|\tilde{X} = 1)\cdot \Pr(\tilde{X} = 1)\\
\end{eqnarray}
$$
We could use bootstrapping to generate samples from $\tilde{X}$, but I'll be Lazy (TM) and
just use a Beta-binomial model of proportions with Jeffreys' prior, since it is
easy to calculate:
```{r X_sim}
x_alpha = 0.5 + table(df$x)["x1"]
x_beta = 0.5 + table(df$x)["x0"]
ggplot() +
stat_halfeye(aes(xdist = dist_beta(x_alpha, x_beta))) +
geom_vline(xintercept = p_x1) +
labs(
title = "Estimate of population Pr(X = x1)",
subtitle = "Marginalizing over X_sim (estimate of population X distribution)"
)
```
Now lets marginalize over $\tilde{X}$:
```{r mean_sim}
mean_sim_df = df |>
tidyr::expand(x) |>
add_epred_draws(m) |>
# generate a distribution of Pr(X_sim = x) values
mutate(
p_x1 = rbeta(n(), x_alpha, x_beta),
p_x = ifelse(x == "x1", p_x1, 1 - p_x1)
) |>
group_by(.draw) |>
# marginalize out X_sim
summarise(.epred = weighted.mean(.epred, p_x)) |>
mutate(method = "estimated pop prob")
mean_sim_df |>
ggplot(aes(x = .epred)) +
stat_halfeye() +
geom_vline(xintercept = marginal_p_y1) +
xlim(c(0,1)) +
labs(
title = "Estimate of Pr(Y = 1)",
subtitle = "Using estimate of population Pr(X = 1) with uncertainty"
)
```
### Comparing the methods
```{r comparison}
bind_rows(
mean_equal_df,
mean_sample_df,
mean_sim_df
) |>
ggplot(aes(y = method, x = .epred)) +
stat_halfeye() +
geom_vline(xintercept = marginal_p_y1) +
xlim(c(0,1)) +
labs(
title = "Estimate of Pr(Y = 1) marginalizing out X in model y ~ x",
subtitle = "True value = black line",
y = "Method of estimating Pr(X = 0), Pr(X = 1)",
)
```