-
Notifications
You must be signed in to change notification settings - Fork 0
/
alternative_models.Rmd
397 lines (283 loc) · 24 KB
/
alternative_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
title: "Part 2: Details of Alternative Models and Model Selection"
abstract: 'This is the Part 2 of the Supplemental Statistical Analysis for "Meta-analysis of genotype-phenotype associations in Bardet-Biedl Syndrome uncovers differences among causative genes". This file describes all Bayesian models we tried throughout this project. Further, we discuss the reasoning behind our choice of model for the main analysis, in particular why between-study variability is crucial and taking into account complete loss of function (cLOF) useful while age, and sex can be omitted. The complete source code for the analysis can be found at https://github.com/martinmodrak/bbs-metaanalysis-bayes or Zenodo, DOI: 10.5281/zenodo.3243264'
output:
pdf_document:
toc: true
---
```{r setup_alternative, echo=FALSE, message = FALSE, warning=FALSE}
knitr::opts_chunk$set(echo=FALSE, cache = TRUE, cache.lazy = FALSE)
library(rstan)
library(brms)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(skimr)
library(readxl)
library(here)
library(mice)
library(tidyverse)
library(tidybayes)
library(bayesplot)
library(cowplot)
source(here("data_processing.R"))
source(here("models.R"))
source(here("models_funcs.R"))
source(here("plots.R"))
```
*Note: For historical reasons the feature of "certain loss of function" (cLOF) as discussed in the data is called just "lof" in most of the analysis code. This part will thus use "lof" and "cLOF" interchangeably.*
## Model descriptions
All models are Bayesian varying intercept logistic regressions using the `brms` package. Part 1 of this supplement includes both accessible and complete mathematical description of the model we chose for the main analysis, which will not be repeated here. Generally, all of the terms in the models are varying intercepts, i.e. the model partially pool the estimates for individual groups (genes, sources, ...) towards population mean to achieve more robust inference.
```{r load_data_alternative}
data <- read_main_data()
genes_to_show <- genes_to_show_from_data(data)
data_long <- data_long_from_data(data)
```
### Base models
Base models are those that work with (a subset of) the original dataset, without any imputation. The models are defined in file `models.R`. They differ in the model formula, subsets of the dataset they use and priors for model coefficients. The syntax for formulas in `brms` is described [in `brms` manual](https://rdrr.io/cran/brms/man/brmsformula.html) and will not be explained here. The model may work on the filtered dataset - `lof` means filtered for only the mutations with certain LOF, `family`, `age`, `sex` and `age_sex` corresponds to filtering for patients with reported family, age, sex or both age and sex. Likewise `ethnic_group` and `family` filter only the data that has those values reported. The list of base models follows:
```{r}
models_base %>% walk(function(def) {
cat(def$name,":\n\tformula:", strwrap(paste(as.character(def$formula)[c(2,1,3)], collapse = " "), width = 80, prefix = "\n\t", initial = "" ), "\n\tdata filter: ", def$filter)
if(def$note != "") {
cat("\tNote:", def$note)
}
cat("\n\n")
#tibble(name = def$name, formula = as.character(def$formula)[1], filter = def$filter, note = def$note)
})
```
The default priors are $N(0,2)$ for all model coefficients (half-normal for standard deviations). Very narrow, narrow and wide put $N(0,0.1)$, $N(0,1)$ and $N(0,5)$ respectively for the sd for `gene`.
Since `family_id` is a very fine-grained predictor, its sd is given a $N(0,1)$ prior.
```{r fit_base_models, results = "hide"}
fits_base <- models_base %>%
map(function(def) {
cat(paste0("Fitting ", def$name, "\n"))
fit_base_model(def, data_long)
}
)
```
### Imputation with the mice package
Including age or sex in the base models is problematic as it involves tossing out `r round(100 * mean(is.na(data$age)))`% or `r round(100 * mean(is.na(data$sex)))`% of data, respectively. This results in wide posterior intervals and weak inferences. To try to ameliorate this we also tested running models on datasets with age and sex imputed, using multiple imputation via the `mice` package. We assume that both age and sex can be related to the functional group of the mutation and to each other. Involving further relations (e.g., individual genes) led to warnings from the `mice` package and we thus didn't use those.
```{r mice_impute, message=FALSE, echo=FALSE, results="hide"}
set.seed(20181217)
predictor_matrix <- matrix(0, nrow = 2, ncol = ncol(data))
colnames(predictor_matrix) <- names(data)
predictor_matrix[, c("functional_group")] <- 1
predictor_matrix[1, c("sex")] <- 1
predictor_matrix[2, c("age_std_for_model")] <- 1
mice_m = 5
data_mice <- mice(data, m = mice_m, blocks = list("age_std_for_model", "sex"), predictorMatrix = predictor_matrix)
if(!is.null(data_mice$loggedEvents)) {
data_mice$loggedEvents %>% select(dep, meth, out) %>% distinct()
}
data_long_mice <- mice::complete(data_mice, "all") %>% map(data_long_from_data)
```
The imputed models differ in the formulas used, but all use the default priors and do not filter the dataset in any way.
```{r}
models_imputed %>% walk(function(def) {
cat(def$name,":\n\tformula:", strwrap(paste(as.character(def$formula)[c(2,1,3)], collapse = " "), width = 80, prefix = "\n\t", initial = "" ))
if(def$note != "") {
cat("\tNote:", def$note)
}
cat("\n\n")
#tibble(name = def$name, formula = as.character(def$formula)[1], filter = def$filter, note = def$note)
})
```
```{r fit_imputed_models}
fits_imputed <- models_imputed %>% map(function(def) { fit_imputed_model(def, data_long_mice)})
all_fits <- c(fits_base, fits_imputed)
```
## Choosing the model for main analysis
In general we use *posterior predictive checks* (PPCheck) to asses model fit. PPCheck is performed by predicting posterior distribution of possible outcomes implied by the fitted model. This distribution can then be compared to what is actually observed in the data and discrepancies can be noted and used to guide model expansion/selection. In our case, we focus on the prevalence of positive phenotypes across various subdivisions of the data. Of prime interests are subdivisions *not* taken into account by a model - if the model explains groupings that were not included, it is a sign that it works well. If the model consistently misestimates groups it is not aware of, it is an indication that such a group should be involved. We use the `bayesplot` package to perform PPChecks. See [Gabry et al. 2018, *Visualisation in Bayesian workflow*](https://arxiv.org/abs/1709.01449) for a more thorough discussion of PPChecks. The following discussion is mostly informal and qualitative as we try to balance model fit, model complexity and other considerations. While the choices we make are partly subjective, Part 3 shows that our conclusions are largely robust to defensible variations in model specification.
In most models, we assume phenotype correlations because the data were selected for containing at least two phenotypes and diagnosis criteria is based on having multiple phenotypes. Both of these processes could have introduced correlations. However, the fitted correlations are not conclusive and do not increase the explanatory power of the model (models with different correlation structures are also included).
### Between-study variability needs to be included
Modeling between-study variability is simply a good practice for any meta-analysis, but PPChecks can convince us that models ignoring it do not fit the data well. Let's start by looking at overall prevalence for studies with at least 10 patients and how the most basic model (`gene_only`, taking only the gene into account) fares:
```{r}
pp_check_helper <- function(model_names, pp_check_types) {
for(name in model_names) {
run_pp_checks(all_models[[name]], all_fits[[name]], data_long, types = pp_check_types
, out_func = function(x) {
base_theme_font_size = 7
base_theme <- theme(
axis.title = element_text(size = base_theme_font_size),
axis.text = element_text(size = base_theme_font_size),
legend.title = element_text(size = base_theme_font_size + 1),
legend.text = element_text(size = base_theme_font_size),
strip.text = element_text(size = base_theme_font_size),
plot.title = element_text(size = base_theme_font_size + 2))
print(x + base_theme)
}
)
}
}
```
```{r, fig.height=4}
pp_check_helper("gene_only", "source_10")
```
Here, the bars represent actual prevalence in the data, the dots the posterior mean for those subgroups and the lines posterior 95% credible interval. Note that we clump all phenotypes together for simplicity. We see that the model is overly certain in its predictions and the posterior credible intervals frequently miss the actual counts in a study. Almost half of sources have predictions not very consistent with data.
We try to ameliorate this with a complex model without source, but including family age and sex (filtered) and filtered only for loss of function mutations. This means we use very little data and a large number of predictors.
```{r, fig.height=2, fig.width = 6}
pp_check_helper(c("gene_family_filtered_age_sex_lof"), "source_10")
```
As seen above, even a complex model with little data struggles to fit the Deveault 2011 study. The problems are only larger for less complex models (e.g. using ethnic group instead of family or ignoring some of the predictors, not shown).
Unsurprisingly, including source explicitly alleviates all of the problems with source, even when other factors are not included:
```{r, fig.height=4}
pp_check_helper("gene_source", "source_10")
```
The conclusion is that even the most complex models not including source have problems. Further, we have a good reason to believe there is between-study variability even before looking at the data, as the methodologies for diagnosis are not consistent. Including source is therefore necessary. In light of this, we consider all models not including source as "Problematic fits" to the data.
### Loss-of-function differences are of a relatively minor importance
Using the simplest model, we see that cLOF differences are slightly problematic (at the edge of model predictions) for many phenotypes, though no gross error is apparent:
```{r, fig.height=4, fig.width = 6}
pp_check_helper("gene_only", "phenotype_lof")
```
The problems almost disappear when adding a per-phenotype cLOF term to the model (i.e. for a given phenotype the effect of cLOF is assumed equal across all mutations):
```{r, fig.height=4, fig.width=6}
pp_check_helper("gene_lof", c("phenotype_lof","gene_lof"))
```
Although the above plot shows that for some genes (most notably BBS8, BBS9 and BBS12) the model has problem explaining differences between cLOF and other mutations.
Adding source as a covariate (but ignoring cLOF) ameliorates most of the problems with cLOF looking at both phenotypes and genes:
```{r, fig.height=4, fig.width = 6}
pp_check_helper("gene_source", c("phenotype_lof","gene_lof"))
```
Finally, the problem is mitigated even further when both source and cLOF are included (even when cLOF effect is not allowed to vary with gene).
```{r, fig.height=4, fig.width = 6}
pp_check_helper("gene_source_lof", c("phenotype_lof","gene_lof"))
```
Allowing the cLOF coefficient to differ per gene does not bring noticeable improvements to fit:
```{r, fig.height=4, fig.width = 6}
pp_check_helper("gene_source_lof_per_gene", c("phenotype_lof","gene_lof"))
```
The conclusion is that: a) between source-variability is important as it is able to explain a large portion of cLOF differences even when cLOF is not accounted for b) there is an improvement in including the effect of cLOF per phenotype but further improvement is not observed when the cLOF effect is allowed to vary per gene. Since cLOF is easy to include and does not make the model much more complex, effect of cLOF per phenotype should be considered.
### Between-study variability mostly explains age and sex differences
Age and sex differences do not necessarily need to be included, as they are sufficiently well explained by the `gene_source_lof` model. First let us look at overall prevalence by sex and age group:
```{r, fig.height=3}
pp_check_helper("gene_source_lof", c("sex","age"))
```
The `gene_source_lof` model does well, although for some groups the data are on the borders of predicted 95% intervals (e.g. the 0-9 and 60+ groups). Note that missing data are treated as a separate age category.
We can also look at age and sex by individual phenotypes - we collapse some of the age groups to make the age + phenotype plot readable:
```{r, fig.height=4}
pp_check_helper("gene_source_lof", c("sex_phenotype","age_phenotype"))
```
There are problems (e.g., age "40+" for REN phenotype), but no category is completely outside the predicted 95% interval. We chose not to include age and sex in the main model, because they are difficult to handle well due to high missingness - requiring either filtering or imputation, while they are mostly explained by the `gene_source_lof` model.
### Family information is of relatively minor influence
We see the `gene_source_lof` model already accounts for a huge portion of the variability in prevalence between families, although including family in the model is definitely an improvement for some families.
```{r, fig.height = 2.5, fig.width = 5}
pp_check_helper(c("gene_source_lof"), "family_4")
```
```{r, fig.height = 2.5, fig.width = 5}
pp_check_helper(c("gene_source_lof_family"), "family_4")
```
```{r, fig.height = 3}
pp_check_helper(c("gene_source_lof", "gene_source_lof_family"), c("family_3"))
```
There are few families where the `gene_source_lof` model is fitting poorly (the LEPV and LNEN families) and a small number of those where the actual data is on the border of the predicted 95% interval.
We however consider the improvements due to including family to be small, while adding family greatly increases the uncertainty of the model, as it is a very fine grained predictor. We therefore chose not to include family in the main model. As discussed in Part 3, this does not have notable impact on our conclusions.
### Understanding within- and between-family variability
We can also look for some signature of family structure in the data directly:
```{r}
data_long %>%
#Create all non-identical pairs within each study
inner_join(data_long, by = c("source" = "source", "phenotype" = "phenotype", "gene" = "gene", "loss_of_function" = "loss_of_function")) %>%
rename(functional_group = functional_group.x) %>%
filter(functional_group == "BBSome", gene != "BBS18", ID.x != ID.y) %>%
mutate(same_family = family_id.x == family_id.y) %>%
#Calculate proportion same phenotype for those in same family and different families
group_by(source, phenotype, gene, loss_of_function, same_family) %>%
summarise(n_patients = length(unique(ID.x)), proportion_same = mean(phenotype_value.x == phenotype_value.y)) %>%
#Compute the difference between same family and different family
group_by(source, phenotype, gene, loss_of_function) %>%
filter(length(unique(same_family)) == 2) %>%
summarise(same_family_diff = proportion_same[same_family] - proportion_same[!same_family], min_n_patients = min(n_patients)) %>%
filter(min_n_patients > 1) %>%
ggplot(aes(x = gene, y = same_family_diff, size = min_n_patients, weight = min_n_patients, color = gene, shape = loss_of_function)) +
geom_hline(yintercept = 0) +
geom_jitter(alpha = 0.5, height = 0, width = 0.3) +
facet_wrap(~phenotype, ncol = 3) +
scale_x_discrete(labels = bbs_labeller) +
scale_y_continuous("Difference within to between family") +
scale_size_continuous(range = c(0.5,4)) +
guides(color = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust =0.5))
```
Here, we take all pairs of patients that are from the same family and all pairs of patients that are from the same source but have no family relationship. The pairs need to have mutation in the same gene and the same cLOF. Over each group of pairs we compute the proportion of pairs that have the same phenotype and then subtract this average for between family pairs from the average of within family pairs. I.e. the higher the number, the more are families with the same mutation homogenous in their phenotypes compared to unrelated individuals. Each point represents one study, the point size represents the size of the smaller group (family or unrelated) in the study.
We see that for most phenotypes the results are mostly positive, so family structure (and hence specific type of mutation and/or genetic background) plays some role above just knowing cLOF and the gene where the mutation occurs.
To measure the amount of variability explained by family structure, we can compute the intraclass correlation (ICC) of family for each phenotype following [Nakagawa & Schielzeth 2010, 'Repeatability for Gaussian and non‐Gaussian data: a practical guide for biologists'](https://doi.org/10.1111/j.1469-185X.2010.00141.x). For simplicity, we will use the `gene_source_lof_family` model, i.e. adding family on top of the model chosen for the main analysis. In this setting the ICC can be computed as:
$$
ICC = \frac{\sigma_{family}^2}{\sigma_{family}^2 +\sigma_{source}^2 + \sigma_{gene}^2 + \sigma_{cLOF}^2 + \frac{1}{3}\pi^2}
$$
$$
\sigma_{cLOF}^2 = Var(\beta_{cLOF} X_{cLOF})
$$
Where $\sigma_{family}$, $\sigma_{source}$ and $\sigma_{gene}$ are the variances of the respective random effects. Corresponding formulas can be devised for other model terms. The ICC can be very roughly interpreted as the proportion of total variance attributable to the grouping factor (family in this case) on the latent sacale. The estimates of ICC for the main covariates (family, source and gene) for individual phenotypes are:
```{r, fig.height= 4}
icc_model <- "gene_source_lof_family"
icc_fit <- all_fits[[icc_model]]
fixed_samples <- posterior_samples(icc_fit, pars = "loss_of")
n_clof <- filter_data_by_model_def(all_models[[icc_model]]$def, data_long) %>%
group_by(phenotype) %>%
summarise(n_clof = sum(loss_of_function_certain), count = n(), proportion_clof = mean(loss_of_function_certain)) %>%
mutate(phenotype = as.character(phenotype))
fixed_variance <- fixed_samples %>%
mutate(sample = 1:n()) %>%
gather("phenotype", "b_clof", -sample) %>%
mutate(phenotype = gsub("b_phenotype|:loss_of_function_certain", "", phenotype)) %>%
inner_join(n_clof, by = c("phenotype" = "phenotype")) %>%
mutate(var_clof = b_clof^2 * (proportion_clof * (1 - proportion_clof)))
varying_samples <- posterior_samples(icc_fit, pars = "sd_")
varying_sd <- varying_samples %>%
mutate(sample = 1:n()) %>%
gather("phenotype_family", "sd_family", sd_family_id__phenotypeRD:sd_family_id__phenotypeDD) %>%
mutate(phenotype_family = gsub("sd_family_id__phenotype", "", phenotype_family)) %>%
gather("phenotype_gene", "sd_gene", sd_gene__phenotypeRD:sd_gene__phenotypeDD) %>%
mutate(phenotype_gene = gsub("sd_gene__phenotype", "", phenotype_gene)) %>%
filter(phenotype_gene == phenotype_family) %>%
gather("phenotype_source", "sd_source", sd_source__phenotypeRD:sd_source__phenotypeDD) %>%
mutate(phenotype_source = gsub("sd_source__phenotype", "", phenotype_source)) %>%
filter(phenotype_gene == phenotype_source) %>%
rename(phenotype = phenotype_gene) %>%
select(-phenotype_source)
icc <- fixed_variance %>%
inner_join(varying_sd, by = c("sample" = "sample", "phenotype" = "phenotype")) %>%
mutate(denominator = sd_family ^ 2 + sd_source ^ 2 + sd_gene ^ 2 + var_clof + pi^2 / 3,
family = sd_family ^ 2 / denominator,
gene = sd_gene ^ 2 / denominator,
source = sd_source ^ 2 / denominator
) %>%
gather("covariate", "icc", family, gene, source)
icc %>%
mutate(phenotype = factor(phenotype, levels = phenotypes_to_use)) %>%
group_by(phenotype, covariate) %>%
summarise(`Mean ICC` = mean(icc),
`low 95%` = quantile(icc, probs = c(0.025)),
`low 50%` = quantile(icc, probs = c(0.25)),
`high 50%` = quantile(icc, probs = c(0.75)),
`high 95%` = quantile(icc, probs = c(0.975))
) %>%
ggplot(aes(x = phenotype,ymin = `low 95%`, ymax = `high 95%`)) +
geom_linerange(aes(ymin = `low 95%`, ymax = `high 95%`)) +
geom_linerange(aes(ymin = `low 50%`, ymax = `high 50%`), size = 2) +
scale_y_continuous("ICC") +
facet_wrap(~ covariate) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust =0.5))
```
Thin lines are 95% credible intervals, thick are 50% credible intervals. Note that for family, the ICC roughly corresponds to the plot above, where CI, REP and REN have the most pronounced skew towards family structure playing a notable role. However, for the other phenotypes, very low values of the ICC are consistent with the data. We also see that the ICC for family is likely smaller than for source and likely not much larger than for the gene carrying the mutation.
Due to notable between-source variability, it is hard to make strong conclusions about family structure - it is still quite possible that there is important influence of family for all phenotypes, but it is masked by the between-study variability. We also know that the `gene_source_lof_family` model does not fit much better than `gene_source_lof` and so it is hard to put strong emphasis on those results.
### Ethnicity and ethnic groups
```{r, fig.height = 3}
pp_check_helper(c("gene_source_lof", "gene_source_lof_ethnic_group"), c("ethnic_group"))
```
We see that `gene_source_lof` does a good job of fitting most ethnic groups, except for F and H where the observed data are close to the borders of the predicted 95% interval. Fit in those groups is improved upon by including ethnic group in the model. However, the misestimated ethnic groups are exactly those with the fewest patients and so are unlikely to bias the estimates in an important way:
```{r}
data %>% group_by(ethnic_group) %>% summarise(count = n()) %>% kable()
```
Finally, we see that `gene_source_lof` fits well even when looking at specific ethnicities and adding ethnic group does not really improve the picture:
```{r, fig.height = 4}
pp_check_helper(c("gene_source_lof", "gene_source_lof_ethnic_group"), c("ethnicity_10"))
```
We chose not to include ethnicity or ethnic group in the model, as it is largely explained with just between-study variability.
### Prior width
One example that the "very narrow" prior prevents the model from fitting well is that the fit cannot capture the BBS3 and Chaperonins functional groups, while "narrow" (and wider) priors don't have a problem with that:
```{r, fig.width = 6, fig.height = 1.5}
pp_check_helper(c("gene_source_very_narrow","gene_source_narrow", "gene_source"), c("functional_group"))
```
We therefore consider the `gene_source_very_narrow` model as a problematic fit. Otherwise we didn't find a good reason to prefer either of the "narrow", normal and "wide" priors and this choice seems to be of little consequence for model inferences (as discussed in Part 3).
### Model selection verdict
We have shown that source (between-study variability) has to be included and there is an advantage in including cLOF per phenotype, but not much improvement when including cLOF per phenotype and gene. Since the `gene_source_lof_per_gene` model is too flexible for the limited amount of data (results in very wide posterior intervals, spanning odds ratio up to 10000), we think `gene_source_lof` is a better choice. We have further shown that adding family or ethnicity information improves the fit only a little while making the model more complex and will therefore not be included for the main analysis.