-
Notifications
You must be signed in to change notification settings - Fork 1
/
vimshift.Rmd
439 lines (373 loc) · 19.7 KB
/
vimshift.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
---
title: "Variable Importance Analysis with Stochastic Interventions"
author: "[Nima Hejazi](https://nimahejazi.org), [Jeremy
Coyle](https://github.com/jeremyrcoyle), and [Mark van der
Laan](https://vanderlaan-lab.org)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
vignette: >
%\VignetteIndexEntry{Variable Importance Analysis with Stochastic Interventions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r options, echo=FALSE, eval=FALSE}
options(scipen=999)
```
## Introduction
Stochastic treatment regimes present a relatively simple manner in which to
assess the effects of continuous treatments by way of parameters that examine
the effects induced by the counterfactual shifting of the observed values of a
treatment of interest. Here, we present an implementation of a new algorithm for
computing targeted minimum loss-based estimates of treatment shift parameters
defined based on a shifting function $d(A,W)$. For a technical presentation of
the algorithm, the interested reader is invited to consult @diaz2018stochastic.
For additional background on Targeted Learning and previous work on stochastic
treatment regimes, please consider consulting @vdl2011targeted,
@vdl2018targeted, and @diaz2012population.
To start, let's load the packages we'll use and set a seed for simulation:
```{r setup, message=FALSE, warning=FALSE}
library(data.table)
library(sl3)
library(tmle3)
library(tmle3shift)
set.seed(429153)
```
## Data and Notation
Consider $n$ observed units $O_1, \ldots, O_n$, where each random variable $O =
(W, A, Y)$ corresponds to a single observational unit. Let $W$ denote baseline
covariates (e.g., age, sex, education level), $A$ an intervention variable of
interest (e.g., nutritional supplements), and $Y$ an outcome of interest (e.g.,
disease status). Though it need not be the case, let $A$ be continuous-valued,
i.e. $A \in \mathbb{R}$. Let $O_i \sim \mathcal{P} \in \mathcal{M}$, where
$\mathcal{M}$ is the nonparametric statistical model defined as the set of
continuous densities on $O$ with respect to some dominating measure. To
formalize the definition of stochastic interventions and their corresponding
causal effects, we introduce a nonparametric structural equation model (NPSEM),
based on @pearl2000causality, to define how the system changes under posited
interventions:
\begin{align*}\label{eqn:npsem}
W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(A, W, U_Y),
\end{align*}
We denote the observed data structure $O = (W, A, Y)$
Letting $A$ denote a continuous-valued treatment, we assume that the
distribution of $A$ conditional on $W = w$ has support in the interval
$(l(w), u(w))$ -- for convenience, let this support be _a.e._ That is, the
minimum natural value of treatment $A$ for an individual with covariates
$W = w$ is $l(w)$; similarly, the maximum is $u(w)$. Then, a simple stochastic
intervention, based on a shift $\delta$, may be defined
\begin{equation}\label{eqn:shift}
d(a, w) =
\begin{cases}
a - \delta & \text{if } a > l(w) + \delta \\
a & \text{if } a \leq l(w) + \delta,
\end{cases}
\end{equation}
where $0 \leq \delta \leq u(w)$ is an arbitrary pre-specified value that
defines the degree to which the observed value $A$ is to be shifted, where
possible.
### Simulate Data
```{r sim_data}
# simulate simple data for tmle-shift sketch
n_obs <- 1000 # number of observations
n_w <- 1 # number of baseline covariates
tx_mult <- 2 # multiplier for the effect of W = 1 on the treatment
# baseline covariates -- simple, binary
W <- as.numeric(replicate(n_w, rbinom(n_obs, 1, 0.5)))
# create treatment based on baseline W
A <- as.numeric(rnorm(n_obs, mean = tx_mult * W, sd = 1))
# create outcome as a linear function of A, W + white noise
Y <- A + W + rnorm(n_obs, mean = 0, sd = 0.5)
```
The above composes our observed data structure $O = (W, A, Y)$. To formally
express this fact using the `tlverse` grammar introduced by the [`tmle3`
package](https://github.com/tlverse/tmle3), we create a single data object and
specify the functional relationships between the nodes in the _directed acyclic
graph_ (DAG) via _nonparametric structural equation models_ (NPSEMs), reflected
in the node list that we set up:
```{r data_nodes}
# organize data and nodes for tmle3
data <- data.table(W, A, Y)
node_list <- list(W = "W", A = "A", Y = "Y")
head(data)
```
We now have an observed data structure (`data`) and a specification of the role
that each variable in the data set plays as the nodes in a DAG.
## Methodology
### Defining a grid of counterfactual interventions
In order to specify a _grid_ of shifts $\delta$ to be used in defining a set of
stochastic intervention policies in an _a priori_ manner, let us consider an
arbitrary scalar $\delta$ that defines a counterfactual outcome $\psi_n =
Q_n(d(A, W), W)$, where, for simplicity, let $d(A, W) = A + \delta$. A
simplified expression of the auxiliary covariate for the TML estimator of $\psi$
is $H_n = \frac{g^{\star}(a \mid w)}{g(a \mid w)}$, where $g^{\star}(a \mid w)$
defines the treatment mechanism with the stochastic intervention implemented. To
ascertain whether a given choice of the shift $\delta$ is _admissable_ -- that
is, whether such an intervention may be implemented while avoiding violations of
the positivity assumption -- define a bound $C(\delta) := \frac{g^{\star}(a
\mid w)}{g(a \mid w)} \leq M$, where $g^{\star}(a \mid w)$ is a function of
$\delta$ in part, and $M$ is a potentially user-specified upper bound of
$C(\delta)$. Then, $C(\delta)$ may be interpreted as a measure of the influence
of a given observation providing a way to limit the maximum influence of a given
observation through a choice of the shift $\delta$ and the setting of the bound
$M$.
We formalize and extend the procedure to determine an acceptable set of values
for the shift $\delta$ in the sequel. Specifically, let there be a shift $d(a,
w) = a + \delta$, where the shift $\delta$ is defined
\begin{equation}
\delta(a, w) =
\begin{cases}
\delta, & \delta_{\text{min}}(a,w) \leq \delta \leq
\delta_{\text{max}}(a,w) \\
\delta_{\text{max}}(a,w), & \delta \geq \delta_{\text{max}}(a,w) \\
\delta_{\text{min}}(a,w), & \delta \leq \delta_{\text{min}}(a,w) \\
\end{cases},
\end{equation}
where $$\delta_{\text{max}}(a, w) = \text{argmax}_{\left\{\delta \geq 0,
\frac{g(a - \delta \mid w)}{g(a \mid w)} \leq M \right\}} \frac{g(a - \delta
\mid w)}{g(a \mid w)}$$ and
$$\delta_{\text{min}}(a, w) = \text{argmin}_{\left\{\delta \leq 0,
\frac{g(a - \delta \mid w)}{g(a \mid w)} \leq M \right\}} \frac{g(a - \delta
\mid w)}{g(a \mid w)}.$$
The above provides a strategy for implementing a shift at the level of a given
observation $(a, w)$, thereby allowing for all observations to be shifted to an
appropriate value -- whether $\delta_{\text{min}}$, $\delta$, or
$\delta_{\text{max}}$. For the purpose of using such a shift in practice, the
present software provides the functions `shift_additive_bounded` and
`shift_additive_bounded_inv`, which define a variation of this shift:
\begin{equation}
\delta(a, w) =
\begin{cases}
\delta, & C(\delta) \leq M \\
0, \text{otherwise} \\
\end{cases},
\end{equation}
which corresponds to an intervention in which the natural value of treatment of
a given observational unit is shifted by a value $\delta$ in the case that the
ratio of the intervened density $g^{\star}(a \mid w)$ to the natural density
$g(a \mid w)$ (that is, $C(\delta)$) does not exceed a bound $M$. In the case
that the ratio $C(\delta)$ exceeds the bound $M$, the stochastic intervention
policy does not apply to the given unit and they remain at their natural value
of treatment $a$.
### _Interlude:_ Constructing Optimal Stacked Regressions with `sl3`
To easily incorporate ensemble machine learning into the estimation procedure,
we rely on the facilities provided in the [`sl3` R
package](https://tlverse.org/sl3). For a complete guide on using the `sl3` R
package, consider consulting https://tlverse.org/sl3, or https://tlverse.org for
the [`tlverse` ecosystem](https://github.com/tlverse), of which `sl3` is a core
engine.
Using the framework provided by the [`sl3` package](https://tlverse.org/sl3),
the nuisance parameters of the TML estimator may be fit with ensemble learning,
using the cross-validation framework of the Super Learner algorithm of
@vdl2007super. To estimate the treatment mechanism (often denoted "g" in the
targeted learning literature), we must make use of learning algorithms
specifically suited to conditional density estimation; a list of such learners
may be extracted from `sl3` by using `sl3_list_learners()`:
```{r sl3_density_lrnrs_search}
sl3_list_learners("density")
```
To proceed, we'll select two of the above learners, `Lrnr_haldensify` for using
the highly adaptive lasso for conditional density estimation, based on an
algorithm given by @diaz2011super, and `Lrnr_density_semiparametric`, an
approach for semiparametric conditional density estimation:
```{r sl3_density_lrnrs}
# learners used for conditional density regression (i.e., propensity score)
haldensify_lrnr <- Lrnr_haldensify$new(
n_bins = 3, grid_type = "equal_mass",
lambda_seq = exp(seq(-1, -9, length = 100))
)
hse_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new())
mvd_lrnr <- Lrnr_density_semiparametric$new(mean_learner = Lrnr_glm$new(),
var_learner = Lrnr_mean$new())
sl_lrn_dens <- Lrnr_sl$new(
learners = list(haldensify_lrnr, hse_lrnr, mvd_lrnr),
metalearner = Lrnr_solnp_density$new()
)
```
We also required an approach for estimating the outcome regression (often
denoted "Q" in the targeted learning literature). For this, we build a Super
Learner composed of an intercept model, a GLM, and the [xgboost
algorithm](https://xgboost.ai/) for gradient boosting:
```{r sl3_regression_lrnrs}
# learners used for conditional expectation regression (e.g., outcome)
mean_lrnr <- Lrnr_mean$new()
glm_lrnr <- Lrnr_glm$new()
xgb_lrnr <- Lrnr_xgboost$new()
sl_lrn <- Lrnr_sl$new(
learners = list(mean_lrnr, glm_lrnr, xgb_lrnr),
metalearner = Lrnr_nnls$new()
)
```
We can make the above explicit with respect to standard notation by bundling
the ensemble learners into a `list` object below.
```{r make_lrnr_list}
# specify outcome and treatment regressions and create learner list
Q_learner <- sl_lrn
g_learner <- sl_lrn_dens
learner_list <- list(Y = Q_learner, A = g_learner)
```
The `learner_list` object above specifies the role that each of the ensemble
learners we've generated is to play in computing initial estimators to be used
in building a TMLE for the parameter of interest here. In particular, it makes
explicit the fact that our `Q_learner` is used in fitting the outcome regression
while our `g_learner` is used in fitting our treatment mechanism regression.
### Initializing `vimshift` through its `tmle3_Spec`
To start, we will initialize a specification for the TMLE of our parameter of
interest (called a `tmle3_Spec` in the `tlverse` nomenclature) simply by calling
`tmle_shift`. We specify the argument `shift_grid = seq(-1, 1, by = 1)`
when initializing the `tmle3_Spec` object to communicate that we're interested
in assessing the mean counterfactual outcome over a grid of shifts `r seq(-1,
1, by = 1)` on the scale of the treatment $A$ (note that the numerical
choice of shift is an arbitrarily chosen set of values for this example).
```{r vim_spec_init}
# what's the grid of shifts we wish to consider?
delta_grid <- seq(-1, 1, 1)
# initialize a tmle specification
tmle_spec <- tmle_vimshift_delta(shift_fxn = shift_additive_bounded,
shift_fxn_inv = shift_additive_bounded_inv,
shift_grid = delta_grid,
max_shifted_ratio = 2)
```
As seen above, the `tmle_vimshift` specification object (like all `tmle3_Spec`
objects) does _not_ store the data for our specific analysis of interest. Later,
we'll see that passing a data object directly to the `tmle3` wrapper function,
alongside the instantiated `tmle_spec`, will serve to construct a `tmle3_Task`
object internally (see the `tmle3` documentation for details).
### Targeted Estimation of Stochastic Interventions Effects
One may walk through the step-by-step procedure for fitting the TML estimator
of the mean counterfactual outcome under each shift in the grid, using the
machinery exposed by the [`tmle3` R package](https://tlverse.org/tmle3) (see
below); however, the step-by-step procedure is more often not of interest.
```{r fit_tmle_manual, eval=FALSE}
# NOT RUN -- SEE NEXT CODE CHUNK
# define data (from tmle3_Spec base class)
tmle_task <- tmle_spec$make_tmle_task(data, node_list)
# define likelihood (from tmle3_Spec base class)
likelihood_init <- tmle_spec$make_initial_likelihood(tmle_task, learner_list)
# define update method (fluctuation submodel and loss function)
updater <- tmle_spec$make_updater()
likelihood_targeted <- Targeted_Likelihood$new(likelihood_init, updater)
# invoke params specified in spec
tmle_params <- tmle_spec$make_params(tmle_task, likelihood_targeted)
updater$tmle_params <- tmle_params
# fit TML estimator update
tmle_fit <- fit_tmle3(tmle_task, likelihood_targeted, tmle_params, updater)
# extract results from tmle3_Fit object
tmle_fit
```
Instead, one may invoke the `tmle3` wrapper function (a user-facing convenience
utility) to fit the series of TML estimators (one for each parameter defined by
the grid delta) in a single function call:
```{r fit_tmle_auto}
# fit the TML estimator
tmle_fit <- tmle3(tmle_spec, data, node_list, learner_list)
tmle_fit
```
_Remark_: The `print` method of the resultant `tmle_fit` object conveniently
displays the results from computing our TML estimator.
### Inference with Marginal Structural Models
In the directly preceding section, we consider estimating the mean
counterfactual outcome $\psi_n$ under several values of the intervention
$\delta$, taken from the aforementioned $\delta$-grid. We now turn our attention
to an approach for obtaining inference on a single summary measure of these
estimated quantities. In particular, we propose summarizing the estimates
$\psi_n$ through a marginal structural model (MSM), obtaining inference by way
of a hypothesis test on a parameter of this working MSM. For a data structure
$O = (W, A, Y)$, let $\psi_{\delta}(P_0)$ be the mean outcome under a shift
$\delta$ of the treatment, so that we have $\vec{\psi}_{\delta} =
(\psi_{\delta}: \delta)$ with corresponding estimators $\vec{\psi}_{n, \delta}
= (\psi_{n, \delta}: \delta)$. Further, let $\beta(\vec{\psi}_{\delta}) =
\phi((\psi_{\delta}: \delta))$.
For a given MSM $m_{\beta}(\delta)$, we have that
$$\beta_0 = \text{argmin}_{\beta} \sum_{\delta}(\psi_{\delta}(P_0) -
m_{\beta}(\delta))^2 h(\delta),$$
which is the solution to
$$u(\beta, (\psi_{\delta}: \delta)) = \sum_{\delta}h(\delta)
\left(\psi_{\delta}(P_0) - m_{\beta}(\delta) \right) \frac{d}{d\beta}
m_{\beta}(\delta) = 0.$$
This then leads to the following expansion
$$\beta(\vec{\psi}_n) - \beta(\vec{\psi}_0) \approx -\frac{d}{d\beta} u(\beta_0,
\vec{\psi}_0)^{-1} \frac{d}{d\psi} u(\beta_0, \psi_0)(\vec{\psi}_n -
\vec{\psi}_0),$$
where we have
$$\frac{d}{d\beta} u(\beta, \psi) = -\sum_{\delta} h(\delta) \frac{d}{d\beta}
m_{\beta}(\delta)^t \frac{d}{d\beta} m_{\beta}(\delta)
-\sum_{\delta} h(\delta) m_{\beta}(\delta) \frac{d^2}{d\beta^2}
m_{\beta}(\delta),$$
which, in the case of an MSM that is a linear model (since
$\frac{d^2}{d\beta^2} m_{\beta}(\delta) = 0$), reduces simply to
$$\frac{d}{d\beta} u(\beta, \psi) = -\sum_{\delta} h(\delta) \frac{d}{d\beta}
m_{\beta}(\delta)^t \frac{d}{d\beta} m_{\beta}(\delta),$$
and
$$\frac{d}{d\psi}u(\beta, \psi)(\psi_n - \psi_0) = \sum_{\delta} h(\delta)
\frac{d}{d\beta} m_{\beta}(\delta) (\psi_n - \psi_0)(\delta),$$
which we may write in terms of the efficient influence function (EIF) of $\psi$
by using the first order approximation $(\psi_n - \psi_0)(\delta) =
\frac{1}{n}\sum_{i = 1}^n \text{EIF}_{\psi_{\delta}}(O_i)$,
where $\text{EIF}_{\psi_{\delta}}$ is the efficient influence function (EIF) of
$\vec{\psi}$.
Now, say, $\vec{\psi} = (\psi(\delta): \delta)$ is d-dimensional, then we may
write the efficient influence function of the MSM parameter $\beta$ (assuming a
linear MSM) as follows
$$\text{EIF}_{\beta}(O) = \left(\sum_{\delta} h(\delta) \frac{d}{d\beta}
m_{\beta}(\delta) \frac{d}{d\beta} m_{\beta}(\delta)^t \right)^{-1} \cdot
\sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta)
\text{EIF}_{\psi_{\delta}}(O),$$ where the first term is of dimension
$d \times d$ and the second term is of dimension $d \times 1$.
In an effort to generalize still further, consider the case where
$\psi_{\delta}(P_0) \in (0, 1)$ -- that is, $\psi_{\delta}(P_0)$ corresponds
to the probability of some event of interest. In such a case, it would be more
natural to consider a logistic MSM
$$m_{\beta}(\delta) = \frac{1}{1 + \exp(-f_{\beta}(\delta))},$$
where $f_{\beta}$ is taken to be linear in $\beta$ (e.g.,
$f_{\beta} = \beta_0 + \beta_1 \delta + \ldots$). In such a case, we have the
parameter of interest
$$\beta_0 = \text{argmax}_{\beta} \sum_{\delta} \left(\psi_{\delta}(P_0)
\text{log} m_{\beta}(\delta) + (1 - \psi_{\delta}(P_0))\log(1 -
m_{\beta}(\delta))\right)h(\delta),$$
where $\beta_0$ solves the following
$$
\sum_{\delta} h(\delta) \frac{d}{d\beta} f_{\beta}(\delta) (\psi_{\delta}(P_0)
- m_{\beta}(\delta)) = 0.$$
Inference from a working MSM is rather straightforward. To wit, the limiting
distribution for $m_{\beta}(\delta)$ may be expressed
$$\sqrt{n}(\beta_n - \beta_0) \to N(0, \Sigma),$$
where $\Sigma$ is the empirical covariance matrix of $\text{EIF}_{\beta}(O)$.
#### Directly Targeting the MSM Parameter $\beta$
Note that in the above, a working MSM is fit to the individual TML estimates of
the mean counterfactual outcome under a given value of the shift $\delta$ in the
supplied grid. The parameter of interest $\beta$ of the MSM is asymptotically
linear (and, in fact, a TML estimator) as a consequence of its construction from
individual TML estimators. In smaller samples, it may be prudent to perform a
TML estimation procedure that targets the parameter $\beta$ directly, as opposed
to constructing it from several independently targeted TML estimates. An
approach for constructing such an estimator is proposed in the sequel.
Let $C = \left(\sum_{\delta} h(\delta) \frac{d}{d\beta} m_{\beta}(\delta)
\frac{d}{d\beta} m_{\beta}(\delta)^t \right)$, then
$$\text{EIF}_{\beta}(O) = C^{-1} \cdot \sum_{\delta} h(\delta)
\frac{d}{d\beta} m_{\beta}(\delta)(Y - \overline{Q}(A,W) + C^{-1} \sum_{\delta}
h(\delta) \frac{d}{d\beta} m_{\beta}(\delta) \left(\int \overline{Q}(a,w)
g_{\delta}^0(a \mid w) - \Psi_{\delta}\right).$$
Suppose a simple working MSM $\mathbb{E}Y_{g^0_{\delta}} = \beta_0 + \beta_1
\delta$, then a TML estimator targeting $\beta_0$ and $\beta_1$ may be
constructed as
$$\overline{Q}_{n, \epsilon}(A,W) = \overline{Q}_n(A,W) + \epsilon (H_1(g),
H_2(g),$$ for all $\delta$, where $H_1(g)$ is the auxiliary covariate for
$\beta_0$ and $H_2(g)$ is the auxiliary covariate for $\beta_1$.
To construct a targeted maximum likelihood estimator that directly targets the
parameters of the working marginal structural model, we may use the
`tmle_vimshift_msm` Spec (instead of the `tmle_vimshift_delta` Spec that
appears above):
```{r vim_targeted_msm_fit}
# what's the grid of shifts we wish to consider?
delta_grid <- seq(-1, 1, 1)
# initialize a tmle specification
tmle_msm_spec <- tmle_vimshift_msm(shift_fxn = shift_additive_bounded,
shift_fxn_inv = shift_additive_bounded_inv,
shift_grid = delta_grid,
max_shifted_ratio = 2)
# fit the TML estimator and examine the results
tmle_msm_fit <- tmle3(tmle_msm_spec, data, node_list, learner_list)
tmle_msm_fit
```
## References