-
Notifications
You must be signed in to change notification settings - Fork 0
/
coexpression.qmd
115 lines (80 loc) · 6.73 KB
/
coexpression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: "Coexpression analysis"
format:
html: default
ipynb: default
jupyter: julia-1.10
execute:
cache: true
keep-ipynb: true
---
## Introduction
While [BioFindr] is developed primarily for causal inference from genomics and transcriptomics data, coexpression analysis of transcriptomics data alone is also possible. In coexpression analysis, pairwise correlation is used as a simple measure for the probability of two genes being functionally related either through direct or indirect regulation, or through coregulation by a third factor. In [BioFindr][1], significance of pairwise correlation is computed using a gene-specific background distribution, allowing for genes having different biological roles. For instance, it is known that many biological networks are [scale-free](https://en.wikipedia.org/wiki/Scale-free_network), where a small number of so-called "hub genes" have a high number of interaction partners while most other genes only have few. In [BioFindr][1], this is accomodated by modelling the distribution of correlation values between a given gene $A$ and all other genes $B$ as a [mixture distribution](https://en.wikipedia.org/wiki/Mixture_distribution) of real and null (random) correlations. The relative weight of each component reflects the prior probability of finding a non-null $B$ gene for a given $A$ gene, and is fitted for every $A$ gene separately.
We will illustrate how to run coexpression analysis with [BioFindr][1] using [preprocessed data][2] from the [GEUVADIS study][3]. See the [installation instructions](installation.qmd) for the steps you need to take to reproduce this tutorial.
## Set up the environment
We begin by setting up the environment and loading some necessary packages.
```{julia}
using DrWatson
quickactivate(@__DIR__)
using DataFrames
using Arrow
using Markdown
using BioFindr
```
## Load expression data
[BioFindr][1] expects that expression data are stored in a [DataFrame][4] where columns correspond to variables (genes) and rows to samples. An expression [DataFrame][4] should not contain any other columns (e.g. gene annotation) than gene expression columns, and gene expression data should be stored as [floating-point numbers](https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/). Internally, [BioFindr][1] operates on matrices, and if you have an expression [DataFrame][4] `df`, then `Matrix(df)` should return a matrix of floats.
At the moment, [BioFindr][1] does not support count-based expression data being provided as a [DataFrame][4] of integers. This is not an intrinsic limitation of the method, but simply to distinguish expression data from integer-valued genotype data. Future versions will remove this limitation by supporting [scientific types](https://juliaai.github.io/ScientificTypes.jl/dev/).
This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called `dt`, and one for microRNA (miRNA) expression data called `dm`:
```{julia}
dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));
```
```{julia}
#| echo: false
nt1, nt2 = size(dt)
nm1, nm2 = size(dm)
Markdown.parse("""
The mRNA data has expression data for $nt2 genes in $nt1 samples, while miRNA data is available for $nm2 miRNAs in the same $nm1 samples.
""")
```
We can confirm that the data frames are of the right type:
```{julia}
[typeof(Matrix(dt)) typeof(Matrix(dm))]
```
## Run BioFindr
### All-vs-all coexpression analysis
The simplest type of coexpression analysis tests for non-zero correlation among all possible pairs in a gene expression dataset. Let's do this for the miRNA data:
```{julia}
dP_mir_all = findr(dm, FDR=0.05)
```
BioFindr computes a [posterior probability](https://tmichoel.github.io/BioFindr.jl/dev/posteriorprobs/) of non-zero correlation for every **Source** and **Target** gene pair. By default the output is sorted by decreasing **Probability** and self-interactions are excluded. The optional parameter **FDR** can be used to limit the output to the set of pairs that has a [global false discovery rate (FDR)](https://en.wikipedia.org/wiki/False_discovery_rate#Storey-Tibshirani_procedure) less than a desired value. The **qvalue** column in the output can be used for further filtering of the output. Say you ran findr with an FRD threshold of 5% as above. If you now want to restrict the output to an FDR threshold of 1%, you can do:
```{julia}
filter!(row -> row.qvalue <= 0.01, dP_mir_all)
```
Note that the [`filter!`](https://dataframes.juliadata.org/stable/lib/functions/#Base.filter!) command modifies the input DataFrame in-place, that is, the rows not matching the selection criteria are deleted. Use [`filter`](https://dataframes.juliadata.org/stable/lib/functions/#Base.filter) to return a new DataFrame with the selected rows.
Finally, remember that the output of coexpression analysis in BioFindr is *not* symmetric, that is
$$
P(Source, Target) \neq P(Target, Source)
$$
This is because the posterior probabilities are estimated using a Source-specific background distribution, accounting for the fact that different genes may have a different number of non-null interaction partners *a priori*. See the [Findr paper][5] for details.
### Bipartite coexpression analysis
Since BioFindr's posterior probabilities are Source gene-specific, we may be interested in computing probabilities only for a subset of Source genes, or using different Source and Target gene sets.
As an example of the first scenario, assume we are interested in finding microRNAs that are significantly correlated with microRNAs from the [Mir-200 family](https://en.wikipedia.org/wiki/Mir-200). First find the Mir-200 microRNAs:
```{julia}
mir200 = names(dm)[startswith.(names(dm),"hsa-miR-200")]
```
Then run
```{julia}
dP_mir200_mir = findr(dm, colnames=mir200, FDR=0.01)
```
The parameter **colnames** must be a vector of strings containing a subset of variable names of the input DataFrame **dm** to be used as **Source** genes.
As an example of the second scenario, we may be interested in finding genes that are significantly correlated with all or a subset of microRNAs:
```{julia}
dP_mir_mrna = findr(dt, dm, FDR=0.01)
```
Note the order of the arguments: here we tested all microRNAs as $A$ or Source genes (`dm` argument) against all mRNA transcripts as $B$ or Target genes (`dt` argument), that is, background distributions are fitted for each microRNA (column of `dm`) from the log-likelihood ratios for all 23,722 mRNAs (columns of `dt`).
[1]: https://github.com/tmichoel/BioFindr.jl
[2]: https://github.com/lingfeiwang/findr-data-geuvadis
[3]: https://doi.org/10.1038/nature12531
[4]: https://dataframes.juliadata.org/stable/
[5]: https://doi.org/10.1371/journal.pcbi.1005703