forked from raphg/Biostat-578
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRNA-seq.Rmd
277 lines (174 loc) · 11.8 KB
/
RNA-seq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
title: 'Bioinformatics for Big Omics Data: RNA-seq data analysis'
author: "Raphael Gottardo"
date: "February 11, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
We will be using the following packages
```{r, cache=FALSE, echo=TRUE}
library(data.table)
library(ggplot2)
library(limma)
library(edgeR)
library(GEOquery)
```
## Outline
Here we will discuss RNA-seq data analysis including normalization and differential expression. You should read the following papers:
1. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509-1517 (2008).
2. Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
3. Anders, S. et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc 8, 1765-1786 (2013).
4. Lund, S. P., Nettleton, D., McCarthy, D. J. & Smyth, G. K. Detecting differential expression in RNA-sequence data using quasilikelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol 11, (2012).
5. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
7. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
## RNA-seq workflow
<img src="Images/RNA-seq-workflow.png" width=500>
## Modeling RNA-seq data
As opposed to microarrays, sequencing leads to count data. In our case, we will try to model counts over genes (or possibly genomic intervals). Let's denote by $Y_{ij}$ the number of reads that fall withing gene $i$ in sample $j$. Then a possible discrete model is the Poisson distribution
$$
Y_{ijk}=\mathrm{Poisson}(w_{ij}N_{jk})
$$
where $w_{ij}$ is the normalized rate of emission of gene $i$ in sample $j$, and $N_{jk}$ is the total read count in lane $k$ for sample $j$. $N_{jk}$ basically accounts for sequencing depth variability from one lane to the next. This is basically the model proposed by Marioni et al. (2010). Note that we have the following constraint $\sum_i w_{ij}=1$.
## Modeling RNA-seq data
The disavantage of the Poisson distribution is that it depends on one parameter only, which can lead to a lack of fit (over-dispersion). Many groups have proposed the negative binomial as an alternative model:
In particular, Robinson and Smyth (2007,2008) proposed the following model:
$$
Y_{ijk}=\mathrm{NB}(w_{ij}N_{jk}, \phi)
$$
where $w_{ij}$, $N_{jk}$ are as defined previously and $\phi$ is the dispersion parameter. Robinson and Smyth use the following parametrization:
$$\mathbb{E}(Y_{ijk})=\mu_{ijk}\quad \mathrm{and}\quad \mathrm{Var}(Y_{ijk})=\mu_{ijk}(1+\mu_{ijk}\phi)$$
where $\mu_{ijk}=w_{ij}N_{jk}$. In this parametrization the poisson distribution can be seen as a special case when $\phi=0$.
The genewise dispersion parameters are estimated by conditional maximum likelihood, conditioning on the total count for that gene (Smyth and Verbyla, 1996). An empirical Bayes procedure is used to shrink the dispersions towards a consensus value, effectively borrowing information between genes (Robinson and Smyth, 2007).
## Estimating the dispersion parameter
As in `limma`, we would like to use an EB approach to shrink the gene-wise dispersion parameters towards a common value. Unfortunately, the NB distribution does not have a conjugate prior. Robinson and Smyth (2007) proposed to use weighted likelihood. Instead of maximizing the likelihood, they propose to maximize the following weighted likelihood
$$
WL(\phi_g)=l_g(\phi_g)+\alpha l_C(\phi_g)
$$
where $\alpha$ is the weight given to the common likelihood (i.e. $\phi$ constant across genes). Then the problem becomes the choice of the appropriate $\alpha$ value. R&S propose some strategies using some EB arguments.
## edgeR
The model described above is implemented in `edgeR`, which also provides a generalized linear model interface for modeling the mean expression value as a function of explanatory variables. This is very similar to the `limma` framework. Then inference is done using a likelihood ratio-test comparing the alternative model to the null model.
## DESeq
`DEseq` is another popular package for RNA-seq analysis, which also utilizes an NB model.
In `edgeR`, the mean variance relationship is defined as $\sigma^2=\mu+\alpha \mu^2$. DESeq differs from `edgeR` in its mean-variance relationship, and the way the dispersion parameters are estimated. As explained in Anders et al. (2013):
> edgeR moderates feature-level dispersion estimates toward a trended mean according to the dispersion-mean relationship. In contrast, DESeq takes the maximum of the individual dispersion estimates
and the dispersion-mean trend. In practice, this means DESeq is less powerful, whereas edgeR is more sensitive to outliers. Recent comparison studies have highlighted that no single method dominates another across all settings.
## Using a normal approximation?
An alternative to the Poisson and NB models would be to find a data transformation that would make the count data approximately normal. Law et al. (2014) propose to use the $\log_2(\mathrm{cpm}(\cdot))$ transformation. The cpm transformation accounts for sequencing depth variability while the $\log_2$ transformation makes the data more normal. However, as we've seen earlier, the mean-variance relationship is quadratic, and a log transformation is not going to remove this dependence. As a consequence, it would be innapropriate to use a normal linear model with constant variance (even gene-wise).
## Mean-variance trend estimation via voom
Law et al. (2014) propose to estimate the mean-variance trend, and then to use the inverse of the estimated standard deviation for each observation as weight in LIMMA.
This is done by the `voom` function in LIMMA.
Basically, `voom` fits the linear model without weights and uses the residuals of the model to estimate the weights, which are then pass onto a weighted LIMMA call for linear modeling.
Law et al. (2014) show that this approach:
1. Control the type I error rate
2. Is powerful among the methods that control the type I error rate
3. Has good FDR control
4. Is faster!
The `voom`+`limma` approach can also be used for gene set analysis, which is difficult to do with count based methods.
## RNA-seq example
Get the data for GSE45735 from GEO.
```{r, message=FALSE}
# You should make sure the directory Data/GEO exists
datadir <- "Data/GEO/"
dir.create(file.path(datadir), showWarnings = FALSE, recursive = TRUE)
# Get the data from GEO
gd <- getGEO("GSE45735", destdir = datadir)
pd <- pData(gd[[1]])
getGEOSuppFiles("GSE45735", makeDirectory=FALSE, baseDir = datadir)
```
## RNA-seq example
Clean up the data.
```{r, message=FALSE}
# Note the regular expression to grep file names
files <- list.files(path = datadir, pattern = "GSE45735_T.*.gz", full.names = TRUE)
# Read in gzip-compressed, tab-delimited files
file_list <- lapply(files, read.table, sep='\t', header=TRUE)
# Subset to only those rows where Gene contains only non-space characters
# This addresses problems with T14 file containing 28 invalid rows at end of file
file_list <- lapply(file_list, function(file_list)subset(file_list, grepl('^[^[:space:]]+$', Gene)))
# Remove duplicated rows
file_list_unique <- lapply(file_list, function(x){x<-x[!duplicated(x$Gene),];
x <- x[order(x$Gene),];
rownames(x) <- x$Gene;
x[,-1]})
```
## RNA-seq example
Create a matrix from the intersection of all genes and clean up the pData.
```{r, message=FALSE}
# Take the intersection of all genes
gene_list <- Reduce(intersect, lapply(file_list_unique, rownames))
file_list_unique <- lapply(file_list_unique, "[", gene_list,)
matrix <- as.matrix(do.call(cbind, file_list_unique))
# Clean up the pData
pd_small <- pd[!grepl("T13_Day8",pd$title),]
pd_small$Day <- sapply(strsplit(gsub(" \\[PBMC\\]", "", pd_small$title),"_"),"[",2)
pd_small$subject <- sapply(strsplit(gsub(" \\[PBMC\\]", "", pd_small$title),"_"),"[",1)
colnames(matrix) <- rownames(pd_small)
```
## RNA-seq example
Note that raw data files for sequencing experiments are available from the SRA database, which can be queried using the SRAdb package:
```{r, eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("SRAdb")
```
The resulting files are usually very large!
## Using Voom
Let's first create an eSet we can use:
```{r}
# Note that I add one to the count
new_set <- ExpressionSet(assayData = matrix+1)
pData(new_set) <- pd_small
```
we now need to set-up our design matrix to estimate our weights:
```{r}
design <- model.matrix(~subject+Day, new_set)
new_set_voom <- voom(new_set,design = design)
```
```{r}
lm <- lmFit(new_set_voom, design)
eb <- eBayes(lm)
# Look at the other time-points
topTable(eb, coef = "DayDay1", number = 5)
```
## Using edgeR
Let's see how we would get setup for edgeR
```{r}
# We don't have the count matrix, so let's try to "estimate" the counts
counts <- ceiling(matrix*10)
dge_list <- DGEList(counts = counts, group = pd_small$Day)
dge_list <- calcNormFactors(dge_list)
design <- model.matrix(~Day+subject, pd_small)
dge_list <- estimateGLMCommonDisp(dge_list,design)
dge_list <- estimateGLMTrendedDisp(dge_list,design)
dge_list <- estimateGLMTagwiseDisp(dge_list,design)
fit <- glmFit(dge_list,design)
lrt <- glmLRT(fit,coef="DayDay1")
detags <- rownames(topTags(lrt, n=20))
```
## Using edgeR
```{r}
plotSmear(lrt, de.tags=detags)
```
## Normalization
As we have seen above, `edgeR` provides a function for estimating the normalizing factor.
The purpose of this normalization strategy is to correct for sequencing depth variability across libraries/lanes. A natural way to achieve this would be simply scale the counts by the inverse of the sum of the counts across genes (i.e. the total number of reads/lane). However, genes with extreme expression values might bias this total read number estimate. Robinson & Oshlack propose a different approach for estimating the normalizing factor, Trimmed Mean of M-values. By default, edgeR uses the TMM normalization strategy.
## Normalization - TMM
<img src="Images/TMM-motivation.png" width=800>
## Normalization - TMM
<img src="Images/TMM-formula.png" width=800>
A trimmed mean is the average after removing the upper and lower x% of the data. The TMM procedure is doubly trimmed, by trimming both the M and A values. By default, we trim the Mg values by 30% and the Ag values by 5%, but these settings can be tailored to a given experiment.
## DEseq and DEXseq
DEseq is also another popular approach for differential expression, and I will leave it up to you to try this package. One major technical advantage of RNA-seq vs microarrays is that you actually get expression level (count) data at the exon level. As such, using RNA-seq it is possible to detect differential usage of exons. This is what the DEXseq package tries to do. DEXseq models binned (exon) counts instead of gene counts.
<img src="http://genome.cshlp.org/content/22/10/2008/F1.large.jpg" width=300>
## Summary
We've explored some of the computational challenges involved in the analysis of RNA-seq data. There exists many different methods, tools and software packages for RNA-seq. I encourage you to explore some of these tools/methods. Some of these have been proposed for your final project.