forked from raphg/Biostat-578
-
Notifications
You must be signed in to change notification settings - Fork 0
/
GSEA.Rmd
337 lines (234 loc) · 12.2 KB
/
GSEA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: 'Bioinformatics for Big Omics Data: RNA-seq data analysis'
author: "Raphael Gottardo"
date: "February 12, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
We will be using the folliwng packages
```{r, cache=FALSE, echo=TRUE}
library(limma)
library(GEOquery)
```
## Going from genes to gene sets
So far we have seen how to use microarrays or RNA-seq to derive a list of significantly differentially expressed genes, while controlling for false discovery.
Sometimes it can be comvenient to look at biological pathways, or more generally genesets to gain biological insights.
## Goals of GSEA
Detecting changes in gene expression datasets can be hard due to
- the large number of genes/probes,
- the high variability between samples, and
- the limited number of samples.
The goal of GSEA is to enable the detection of modest but coordinated
changes in prespecified sets of related genes. Such a set might include all the genes in a specific pathway,
for instance, or genes that have been shown to be coregulated based on previously published studies.
Most of what I'll be discussing here will be based on this paper:
- Wu, D. & Smyth, G. Camera: a competitive gene set test accounting for inter-gene correlation. 40, e133-e133 (2012).
## Competitive vs self-contained analyses
As explained in Wu & Smyth (2012)
Two approaches can be used to test the significance of a gene set:
1) ‘self-contained’ gene set tests examine a set of genes in their own right without reference to other genes in the genome (or array)
2) ‘competitive’ gene set tests compare genes in the test set relative to all other genes.
Competitive tests focus more on distinguishing the most important biological processes from those that are less important. Competitive tests are overwhelmingly more commonly used in the genomic literature. In particular, this is the approach used in the GSEA paper:
- Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545-15550 (2005).
## Accounting for within set correlation
Most competitive gene set tests assume independence of genes, because they evaluate P-values by permutation of gene labels. However, these tests can be sensitive to inter-gene correlations.
In this lecture we will talk about geneset analysis using the approach of Wu and Smyth that accounts for inter-gene correlations.
## CAMERA
Camera, for competitive gene set test accounting for inter-gene correlation, makes heavy used of the limma framework. The same linear model is assumed for the mean gene expression level, namely,
$$ \mathbb{E}(\mathbf{y}_g)=\mathbf{X}\boldsymbol{\alpha}_g$$
and we will also write $\mathrm{cov}(y_{gi},y_{g'i})=\rho_{gg'}$
Note that this correlation is the residual treatment effect, once any treatment effect has been removed.
As with `limma`, we assumed that a specific contrast is of interest:
$$\beta_g=\sum_{j=1}^p c_j \alpha_{gj}$$
and we wish to test $H_0: \beta_g=0$, which can be done using the moderated $t$ statistics, $\tilde{t}$ that follows a t-statistics with $d+d_0$ degrees of freedom.
## CAMERA
Then W&S define a normalized version $z_g =F^{-1}F_t(\tilde{t}_g)$
**What is the distribution of $z_g$?**
Consider a set of m genewise statistics $z_1,\dots , z_m$. The variance of the mean of the statistics is
$$ \mathrm{var}(\overline{z})=1/m^2(\sum_{i} \tau_i^2 + 2\sum_{i\lt j}\tau_i\tau_j)$$
where $\tau_i$ is the standard deviation of $z_i$ and the $\rho_{ij}$ are the pairwise correlations. We can rewrite it as follows when all $\tau_i$'s are equal
$$\mathrm{var}(\overline{z})=\tau^2/m \mathrm{VIF}$$
where VIF is the variance inflation factor $1+(m-1)\overline{\rho}$.
## CAMERA - Testing
Now the idea of competitive gene-set analysis can be done by comparing two mean set statistics $\overline{z}_1$ and $\overline{z}_2$. Where $z_1$ is our set of interest and $z_2$ is the set composed of all other genes. The main idea behind CAMERA is to form a test-statistics that incorporates dependence between the genes in the first set.
This can simply be done by forming the following T-statistics
$$
T=(\overline{z}_1-\overline{z}_2)/(s_p\sqrt{VIF/m_1+1/m_2})
$$
Now we just need a way to estimate the inter-gene correlation. W&S also propose a
modified version of the wilcoxon-rank-sum test, which we won't discuss here.
## CAMERA - Estimating covariances
Let's consider the QR decomposition of the design matrix $X=QR$, where $R$ is upper triangular ($n\times p$) and $Q$ ($n\times n$) is such that $Q^TQ=I$. An $m\times n$ matrix of independent residuals is obtained by $U = YQ_2$, where $Q_2$ represents the trailing $d$ columns of $Q$. The matrix $U$ is already available as a by-product of fitting genewise linear models to the expression values using standard numerical algorithms, so extracting it requires no extra computation in `limma`.
The residual standard error $s_g$ for gene $g$ is equal to the root mean square of the corresponding row of $U$. We standardize each row of $U$ by dividing by $s_g$.
At this point, we could obtain the correlation matrix for the $m$ genes from $C = UU^T$; however, this is a numerically inefficient procedure if $m$ is large.
A numerically superior algorithm is to compute the column means $\overline{u}_{\cdot k}$ of $U$. Then we can form the following estimate of the $VIF$
$$
\widehat{\mathrm{VIF}}=\frac{m}{d}\sum_d \overline{u}^2_{\cdot k}
$$
If $m$ and $d$ are both reasonably large, and $\overline{\rho}$ is relatively small, then VIF is approximately distributed as VIF $\chi^2_d/d$. This can be used to find an asymptotic distribution for our test statistics.
## Using CAMERA
Camera is readily available in the `limma` package. Let us go back to our RNA-seq example:
```{r get-data, message=FALSE}
datadir <- "Data/GEO/"
dir.create(file.path(datadir), showWarnings = FALSE, recursive = TRUE)
gd <- getGEO("GSE45735", destdir = datadir)
pd <- pData(gd[[1]])
getGEOSuppFiles("GSE45735", makeDirectory=FALSE, baseDir = datadir)
# Note the regular expression to grep file names
files <- list.files(path = datadir, pattern = "GSE45735_T.*.gz", full.names = TRUE)
# Read in gzip-compressed, tab-delimited files
file_list <- lapply(files, read.table, sep='\t', header=TRUE)
```
## Using CAMERA
```{r}
# Subset to only those rows where Gene contains only non-space characters
# This addresses problems with T14 file containing 28 invalid rows at end of file
file_list <- lapply(file_list, function(file_list)subset(file_list, grepl('^[^[:space:]]+$', Gene)))
# Remove duplicated rows
file_list_unique <- lapply(file_list, function(x){x<-x[!duplicated(x$Gene),];
x <- x[order(x$Gene),];
rownames(x) <- x$Gene;
x[,-1]})
# Take the intersection of all genes
gene_list <- Reduce(intersect, lapply(file_list_unique, rownames))
file_list_unique <- lapply(file_list_unique, "[", gene_list,)
matrix <- as.matrix(do.call(cbind, file_list_unique))
# Clean up the pData
pd_small <- pd[!grepl("T13_Day8",pd$title),]
pd_small$Day <- sapply(strsplit(gsub(" \\[PBMC\\]", "", pd_small$title),"_"),"[",2)
pd_small$subject <- sapply(strsplit(gsub(" \\[PBMC\\]", "", pd_small$title),"_"),"[",1)
colnames(matrix) <- rownames(pd_small)
```
## CAMERA - Hands on
```{r expression-set, dependson="get-data"}
# Note that I add one to the count
new_set <- ExpressionSet(assayData = matrix+1)
pData(new_set) <- pd_small
```
we now need to set-up our design matrix to estimate our weights:
```{r, dependson="get-data;expression-set"}
design <- model.matrix(~subject+Day, new_set)
new_set_voom <- voom(new_set,design = design)
```
## CAMERA - Hands on
```{r}
lm <- lmFit(new_set_voom, design)
eb <- eBayes(lm)
# Look at the other time-points
topTable(eb, coef = "DayDay1", number = 5)
```
## MSigDB
The [Molecular Signatures Database](http://www.broadinstitute.org/gsea/msigdb/index.jsp) (MSigDB) is a collection of annotated gene sets for use with GSEA analysis.
These gene sets are available from download as gmt files, and can be read into R using `GSEAbase`.
Let's first download and install the package that we need
```{r, eval=FALSE}
library(BiocInstaller)
biocLite("GSEABase")
```
note that `camera` is available as part of `limma`, so there is nothing to install.
## Getting started with GSEA analyses
We load the `GSEAbase` package for loading gene sets.
```{r}
library(GSEABase)
```
and convert the gene sets to gene indices
```{r, cache=TRUE}
c2_set <- getGmt("GSEA-sets/c2.all.v4.0.symbols.gmt")
gene_ids <- geneIds(c2_set)
# Camera requires gene-indices.
# Which function to use will depend on which version of limma you have.
# http://bioconductor.org/packages/release/bioc/news/limma/NEWS
# "symbols2indices() renamed to ids2indices()."
library(limma)
if (exists("ids2indices")) {
sets_indices <- ids2indices(gene_ids, rownames(new_set))
}
if (exists("symbols2indices")) {
sets_indices <- symbols2indices(gene_ids, rownames(new_set))
}
```
## Finding enriched gene sets
As with `limma`, we need to specify the contrast we wish to test at the set level:
```{r}
# Note that camera works on voom objects
cont_matrix <- makeContrasts("DayDay1", levels=design)
res <- camera(new_set_voom, sets_indices, design=design, cont_matrix)
res[1:10, ]
```
## Finding enriched gene sets over time
```{r}
res <- vector("list",length = 10)
for(i in 1:10)
{
contrast <- paste0("DayDay",i)
cont_matrix <- makeContrasts(contrast, levels=design)
res[[i]] <- camera(new_set_voom, sets_indices, design=design, contrast=cont_matrix, sort=FALSE)
}
```
## Visualizing the results
```{r}
library(pheatmap)
PValue <- sapply(res, function(x){ifelse(x$Direction=="Up", -10*log10(x$PValue), 10*log10(x$PValue))})
rownames(PValue) <- rownames(res[[1]])
PValue_max <- rowMax(abs(PValue))
PValue_small <- PValue[PValue_max>30, ]
anno <- data.frame(Time=paste0("Day",1:10))
rownames(anno) <- colnames(PValue_small) <- paste0("Day",1:10)
```
## Visualizing the results
```{r, echo=FALSE, fig.width=8, fig.height=5.5}
pheatmap(PValue_small, cluster_cols=FALSE,fontsize_row = 5)
```
## Using non MSigDB gene_sets
Any genesets can be used for a GSEA analysis. For example, we can use the sets published in:
Li, S. et al. Molecular signatures of antibody responses derived from a systems biology study of five human vaccines. Nat. Immunol. 15, 195–204 (2013).
```{r, cache=TRUE}
BTM_set <- getGmt("GSEA-sets/BTM_for_GSEA_20131008.gmt")
gene_ids <- geneIds(BTM_set)
# Camera requires gene-indices
if (exists("ids2indices")) {
sets_indices <- ids2indices(gene_ids, rownames(new_set))
}
if (exists("symbols2indices")) {
sets_indices <- symbols2indices(gene_ids, rownames(new_set))
}
```
```{r}
res <- vector("list",length = 10)
for(i in 1:10)
{
contrast <- paste0("DayDay",i)
cont_matrix <- makeContrasts(contrast, levels=design)
res[[i]] <- camera(new_set_voom, sets_indices, design=design, contrast=cont_matrix, sort=FALSE)
}
```
## Visualizing the results
```{r}
PValue <- sapply(res, function(x){ifelse(x$Direction=="Up", -10*log10(x$PValue), 10*log10(x$PValue))})
rownames(PValue) <- rownames(res[[1]])
PValue_max <- rowMax(abs(PValue))
PValue_small <- PValue[PValue_max>30, ]
anno <- data.frame(Time=paste0("Day",1:10))
rownames(anno) <- colnames(PValue_small) <- paste0("Day",1:10)
```
## Visualizing the results
```{r}
pheatmap(PValue_small, cluster_cols=FALSE)
```
## Conclusion
- Gene sets provide a convenient way to summarize gene activities over known pathways, which facilitate biological interpretation
- Gene set analysis are complementary to gene based analyses, as gene sets might masked gene level signal
- You are not limited to using the predefined gene sets.
- Specific applications might require specific gene sets (e.g. Immune gene sets).
- The GSEA gmt format provides a convenient way to do this.