-
Notifications
You must be signed in to change notification settings - Fork 0
/
internal-functions-1.qmd
222 lines (161 loc) · 8.3 KB
/
internal-functions-1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Internal functions"
format:
html:
code-fold: false
jupyter: julia-1.10
execute:
cache: true
keep-ipynb: true
---
## Introduction
The only functions [exported](https://docs.julialang.org/en/v1/manual/modules/#Export-lists) by [BioFindr][1] are the [`findr`](https://tmichoel.github.io/BioFindr.jl/dev/testLLR/) and [`dagfindr!`](https://lab.michoel.info/BioFindr.jl/dev/bayesiannets/) functions. Nevertheless, many of the internal functions may be useful when digging deeper in the results for specific genes. The [package documentation][6] contains detailed descriptions of all package functions, intertwined with the methods section of the [original paper][5], and should give a good overview of what is available. To illustrate how these functions can be used, we will reproduce the following figure ([Supplementary Fig. S1](https://doi.org/10.1371/journal.pcbi.1005703.s002) from the [original paper][5]):
![LLR distribution of the relevance test for hsa-miR-200b-3p on 23722 potential targets of Geuvadis dataset.](figures/eg4.png)
## Set up the environment
```{julia}
using DrWatson
quickactivate(@__DIR__)
using DataFrames
using Arrow
using StatsPlots
using LaTeXStrings
using Distributions
using BioFindr
```
## Load data
You should by now be familiar with the GEUVADIS data used in the *First steps* tutorials. Here we need the following files:
```{julia}
dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));
dgm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow")));
```
We also need the microRNA eQTL mapping (see the [causal inference tutorial](causal-inference.qmd)):
```{julia}
dpm = DataFrame(SNP_ID = names(dgm), GENE_ID=names(dm)[1:ncol(dgm)]);
```
Set the microRNA of interest:
```{julia}
mirA = "hsa-miR-200b-3p";
```
## Run the analysis
### Convert the data
Internally, all [BioFindr][1] functions use matrix-based inputs and [supernormalized data](https://tmichoel.github.io/BioFindr.jl/dev/inference/). The easiest way to convert our data is to run `supernormalize` on the initial data:
```{julia}
Yt = BioFindr.supernormalize(dt);
Ym = vec(BioFindr.supernormalize(select(dm,mirA)));
```
For the genotype data, no conversion is needed:
```{julia}
E = dgm[:, dpm.SNP_ID[dpm.GENE_ID.==mirA][1]];
```
We will also need the number of samples and number of genotype groups:
```{julia}
ns = length(E);
ng = length(unique(E));
```
### Compute log-likelihood ratios
Throughout the package, the [likelihood ratio tests](https://tmichoel.github.io/BioFindr.jl/dev/realLLR/) are labelled by the following [symbols](https://docs.julialang.org/en/v1/manual/metaprogramming/#Symbols)
- Test 2 (**Linkage test**): `:link`
- Test 3 (**Mediation test**): `:med`
- Test 4 (**Relevance test**): `:relev`
- Test 5 (**Pleiotropy test**): `:pleio`
Since [all log-likelihood ratios are computed from the same summary statistics](https://tmichoel.github.io/BioFindr.jl/dev/realLLR/#Implementation), a single function computes them all. To compute the log-likelihood ratios for a specific A-gene (here: hsa-miR-200b-3p with column vector of expression data `Ym`) with a causal instrument (best eQTL) with genotype vector `E`, run:
```{julia}
llr2, llr3, llr4, llr5 = BioFindr.realLLR_col(Yt, Ym, E);
```
If you know you're only going to use one of them, you can also run:
```{julia}
_ , _ , llr4 , _ = BioFindr.realLLR_col(Yt, Ym, E);
```
### Compute posterior probabilities
Posterior probabilities are computed by [fitting a mixture model](https://tmichoel.github.io/BioFindr.jl/dev/posteriorprobs/) to the observed vector of log-likelihood ratios. Two fitting methods are implmented: a [method of moments](https://tmichoel.github.io/BioFindr.jl/dev/posteriorprobs/#mom_postprobs) or using [kernel density estimation](https://tmichoel.github.io/BioFindr.jl/dev/posteriorprobs/#kde_postprobs). The method of moments is the default:
```{julia}
pp_mom, dmix = BioFindr.fit_mixdist_mom(llr4, ns, ng, :relev);
```
The KDE estimate is obtained similarly:
```{julia}
pp_kde = BioFindr.fit_mixdist_KDE(llr4, ns, ng, :relev);
```
The method of moments has a second output argument, `dmix`, a [mixture model distribution object](https://juliastats.org/Distributions.jl/stable/mixture/) where each mixture component is an [`LBeta` dsitribution](https://tmichoel.github.io/BioFindr.jl/dev/randomLLR/#Implementation):
```{julia}
dmix
```
The first component in the mixture model is the null distribution, which can also be created as follows:
```{julia}
dnull = BioFindr.nulldist(ns,ng,:relev)
```
The `prior` of the null component is the [estimated proportion of truly null features](https://tmichoel.github.io/BioFindr.jl/dev/posteriorprobs/#Estimating-P({\\mathcal-H}_{\\mathrm{null}})) in the observed log-likelihood ratio vector `llr4`:
```{julia}
π₀ = dmix.prior.p[1]
```
We can verify that both methods (moments and KDE) give similar posterior probabilities
```julia
scatter(pp_mom,pp_kde, markersize=4)
```
### Compute p-values under the null hypothesis
We don't need null p-values to reproduce the figure above, but they can be used to assess the quality of the $\pi_0$ estimate.
```{julia}
pnull = BioFindr.nullpval(llr4,ns,ng,:relev);
```
We can verify that the histogram shows the [characteristic shape of a set of anti-conservative p-values](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/) and that $\pi_0$ correctly estimates the height of the "flat" portion of the histogram near $p\approx 1$:
```{julia}
histogram(pnull, normalize=:pdf, bins=100, label="")
hline!([π₀],linewidth=2, label="", linecolor=:red)
xlims!(0,1)
xlabel!("Null p-value")
ylabel!("Observed distribution")
annotate!(0.95,0.95, (L"\pi_0", 18, :red))
```
## Reproduce the figure
### Method of moments estimates
For the method of moments, the null and real log-likelihood ratio distribution are available in the form of [distribution objects](https://juliastats.org/Distributions.jl/), and we can simply evaluate their pdfs on a range of values:
```{julia}
lval = range(0,maximum(llr4),500);
pnull_val = π₀ * pdf.(dnull,lval);
preal_val = pdf.(dmix,lval);
pp_val = 1 .- pnull_val ./ preal_val;
```
Plot the final figure:
```{julia}
histogram(llr4, normalize=:pdf, bins=100, color=:navajowhite1, label="Real data", size=(600,450))
plot!(lval,preal_val, linewidth=2, color=:black, label=L"p(LLR^{(4)})")
plot!(lval,pnull_val, linewidth=2, color=:red, label=L"\pi_0 p(LLR^{(4)} \mid \mathcal{H}_0)", legend=(0.25,0.95))
ylims!(0, 160)
xlabel!(L"LLR^{(4)}")
ylabel!(L"p(LLR^{(4)})")
plot!(twinx(),lval,pp_val, linewidth=2, color=:blue, label="", yguidefontcolor=:blue, ylims=(0,1.), ylabel=L"P(H^{(4)}_{alt} \mid LLR^{(4)})")
#ylabel(L"P(H^{(4)}_{alt} \mid LLR^{(4)})")
xlims!(0,0.03)
```
Compared to the figure at the top of the page, we see that the method of moments provides a smooth fit to the histogram and consequently also posterior probabilities that increase more smoothly with increasing LLR values.
### KDE estimates
For the KDE method, we don't have a distribution object fitting the histogram. Instead with use kernel density estimation and return estimated pdf values at every value of the LLR input vector:
```{julia}
preal_kde = BioFindr.fit_kde(llr4);
```
For plotting, we filter a relevant range of values from all vectors:
```{julia}
rg = 1:50:length(llr4);
t = sortperm(llr4);
lval_kde = llr4[t][rg];
preal_val_kde = preal_kde[t][rg];
pp_val_kde = pp_kde[t][rg];
```
And plot the figure again:
```{julia}
histogram(llr4, normalize=:pdf, bins=100, color=:navajowhite1, label="Real data", size=(600,450))
plot!(lval_kde,preal_val_kde, linewidth=2, color=:black, label=L"p(LLR^{(4)})")
plot!(lval,pnull_val, linewidth=2, color=:red, label=L"\pi_0 p(LLR^{(4)} \mid \mathcal{H}_0)", legend=(0.25,0.95))
ylims!(0, 160)
xlabel!(L"LLR^{(4)}")
ylabel!(L"p(LLR^{(4)})")
plot!(twinx(),lval_kde,pp_val_kde, linewidth=2, color=:blue, label="", yguidefontcolor=:blue, ylims=(0,1.), ylabel=L"P(H^{(4)}_{alt} \mid LLR^{(4)})")
#ylabel(L"P(H^{(4)}_{alt} \mid LLR^{(4)})")
xlims!(0,0.03)
```
[1]: https://github.com/tmichoel/BioFindr.jl
[2]: https://github.com/lingfeiwang/findr-data-geuvadis
[3]: https://doi.org/10.1038/nature12531
[4]: https://dataframes.juliadata.org/stable/
[5]: https://doi.org/10.1371/journal.pcbi.1005703
[6]: https://tmichoel.github.io/BioFindr.jl/dev/