forked from hbctraining/scRNA-seq
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathold-single_cell_rnaseq_MS.Rmd
345 lines (241 loc) · 14 KB
/
old-single_cell_rnaseq_MS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
title: "Single-cell RNA-seq clustering tutorial with Seurat"
author: "Michael J. Steinbaugh"
date: "`r Sys.Date()`"
bibliography: bcbioSinglecell.bib
---
```{r setup, message=FALSE}
# Seurat tutorial with example 10X PBMC data
# http://satijalab.org/seurat/pbmc-tutorial.html
# https://github.com/satijalab/satijalab.github.io/blob/master/seurat/pbmc-tutorial.Rmd
# source("https://bioconductor.org/biocLite.R")
# biocLite(c("import", "tidyverse", "satijalab/seurat", "hbc/bcbioSinglecell"))
# Download references
bcbioSinglecell::downloads("bcbioSinglecell.bib")
library(tidyverse)
library(Seurat)
import::from(Matrix, colSums)
import::from(plyr, mapvalues)
# Knitr options
knitr::opts_chunk$set(
cache = TRUE,
cache.lazy = FALSE,
message = FALSE)
```
All features in [Seurat][] have been configured to work with both regular and sparse matrices. We prefer to use sparse matrices from the [Matrix][] package, as they result in significant memory and speed savings.
# Download example data
```{r download_file}
download.file(
file.path("https://s3-us-west-2.amazonaws.com",
"10x.files",
"samples",
"cell",
"pbmc3k",
"pbmc3k_filtered_gene_bc_matrices.tar.gz"),
destfile = "pbmc.tar.gz")
untar("pbmc.tar.gz")
```
# Cell and gene filtering
While `Setup` imposes a basic minimum gene-cutoff, you may want to filter out cells at this stage based on technical or biological parameters. [Seurat][] allows you to easily explore QC metrics and filter cells based on any user-defined criteria. `nGene` and `nUMI` are automatically calculated for every object by [Seurat][]. For non-UMI data, `nUMI` represents the sum of the non-normalized values within a cell.
Let's filter out cells that have unique gene counts below or above average. Note that `accept.high` and `accept.low` can be used to define a "gate", and can filter cells not only based on `nGene` but on anything in the object.
Initialize a new [Seurat][] object with non-normalized count data.
We recommend the following general filter criteria:
- `log normalize`, first scaling each cell to a total of `1e4` molecules (*default*) [@Drop-seq].
- Keep `genes` expressed in `>= 3 cells` (*default*).
- Keep `cells` with `>= 200 genes`.
- Keep `cells` with `<= 2500 genes`.
- Keep `cells` with `<= 5% mitochondrial percentage`.
```{r pbmc.data}
pbmc.data <- Read10X("filtered_gene_bc_matrices/hg19/")
# A regular matrix is memory inefficient.
# Save counts as a sparse matrix!
?"dgCMatrix-class"
dense.size <- object.size(as.matrix(pbmc.data))
print(paste("Regular matrix:",
format(dense.size, units = "auto")))
sparse.size <- object.size(pbmc.data)
print(paste("Sparse matrix:",
format(sparse.size, units = "auto")))
```
```{r new}
?Setup
pbmc <- new("seurat", raw.data = pbmc.data)
pbmc <- Setup(pbmc,
min.cells = 3,
min.genes = 200,
do.logNormalize = TRUE,
total.expr = 1e4,
project = "10X_PBMC")
```
Let's calculate the relative mitochondrial abundance, which is useful for filtering out low-quality cells:
```{r AddMetaData}
?AddMetaData
mito.genes <- grep("^MT-", rownames(pbmc@data), value = TRUE)
mito.ratio <-
colSums(expm1(pbmc@data[mito.genes, ])) /
colSums(expm1(pbmc@data))
pbmc <- AddMetaData(pbmc, mito.ratio, "mito.ratio")
```
Now we can subset the data using our filtering parameters:
```{r SubsetData}
?SubsetData
pbmc <- SubsetData(pbmc, subset.name = "nGene", accept.high = 2500)
pbmc <- SubsetData(pbmc, subset.name = "mito.ratio", accept.high = 0.05)
```
Violin plots are useful for visualizing single cell data:
```{r VlnPlot}
?VlnPlot
VlnPlot(pbmc, c("nGene", "nUMI", "mito.ratio"), nCol = 3)
```
`GenePlot()` is typically used to visualize gene-gene relationships, but can be used for anything calculated by the object (e.g. `seurat@data.info` columns).
```{r GenePlot}
?GenePlot
par(mfrow = c(1, 2))
GenePlot(pbmc, "nUMI", "mito.ratio")
GenePlot(pbmc, "nUMI", "nGene")
```
# Regress out unwanted sources of variation
Your single cell experiment likely contains "uninteresting" sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (e.g. cell cycle stage). Regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering [@Buettner:2015hp]. [Seurat][] implements a basic version of this by constructing linear models to predict gene expression based on user-defined variables. [Seurat][] stores the z-scored residuals of these models in the `scale.data` slot, and they are used for dimensionality reduction and clustering.
It is typically recommended to regress out cell-cell variation in gene expression driven by batch, the number of detected molecules, and mitochondrial gene expression. For cycling cells, we can also learn a "cell-cycle" score [@Drop-seq] and regress this out as well.
Here, let's regress effects on gene expression that are due to the number of genes detected (`nGene`), the total counts in the cell (`nUMI`), and the percent mitochondrial content (`mito.ratio`).
```{r RegressOut}
# CPU intensive (skip during demo)
?RegressOut
pbmc <- RegressOut(pbmc, latent.vars = c("nGene", "nUMI", "mito.ratio"))
```
# Detection of variable genes across the single cells
[Seurat][] calculates highly variable genes and focuses on these for downstream analysis. `MeanVarPlot()`, which works by calculating the average expression and dispersion for each gene, placing these genes into bins, and then calculating a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression [@Drop-seq].
```{r MeanVarPlot}
?MeanVarPlot
pbmc <- MeanVarPlot(
pbmc,
fxn.x = expMean, fxn.y = logVarDivMean,
x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5,
do.contour = FALSE)
```
# Linear dimensional reduction
Run PCA on the scaled data. By default, the genes in `seurat@var.genes` are used as input, but can be defined using `pc.genes`. We have typically found that running dimensionality reduction on genes with high-dispersion can improve performance. However, with UMI data - particularly after using `RegressOut()`, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.
```{r PCA}
# CPU intensive (skip during demo)
?PCA
pbmc <- PCA(pbmc)
```
`ProjectPCA()` scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. Though we don't use this further here, it can be used to identify markers that are strongly correlated with cellular heterogeneity, but may not have passed through variable gene selection.
```{r ProjectPCA}
?ProjectPCA
pbmc <- ProjectPCA(pbmc)
```
We can visualize these genes graphically using `VizPCA()`.
```{r VizPCA}
?VizPCA
VizPCA(pbmc, 1:2)
```
Let's plot the PCA using `PCAPlot()`.
```{r PCAPlot}
?PCAPlot
PCAPlot(pbmc, 1, 2)
```
In particular, `PCHeatmap()` allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and genes are ordered according to their PCA scores. Setting `cells.use` to a number plots the "extreme" cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.
```{r PCHeatmap, fig.height=8, fig.width=5, message=FALSE, warning=FALSE}
?PCHeatmap
PCHeatmap(pbmc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE,
label.columns = FALSE, use.full = FALSE)
```
# Determine statistically significant principal components
To overcome the extensive technical noise in any single gene for scRNA-seq data, [Seurat][] clusters cells based on their PCA scores, with each PC essentially representing a 'metagene' that combines information across a correlated gene set. Determining how many PCs to include downstream is therefore an important step.
[Seurat][] implements a resampling test inspired by the `jackStraw` procedure [@Drop-seq]. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a "null distribution" of gene scores, and repeat this procedure. We identify "significant" PCs as those who have a strong enrichment of low p-value genes.
The `JackStrawPlot()` function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). "Significant" PCs will show a strong enrichment of genes with low p-values (solid curve above the dashed line). Running this process takes a long time for big datasets. More approximate techniques such as those implemented in `PCElbowPlot()` are much faster.
```{r JackStraw, eval=FALSE}
# CPU intensive, run on Orchestra only
pbmc <- JackStraw(pbmc, num.replicate = 100, do.print = FALSE)
JackStrawPlot(pbmc, PCs = 1:12)
```
```{r PCElbowPlot}
?PCElbowPlot
PCElbowPlot(pbmc)
```
# Find clusters and visualize
Here we will use the selected PCs to find clusters of cells via heirarchical clustering, then run t-Distributed Stochastic Neighbor Embedding (t-SNE) analysis [@vanderMaaten:2008tm].
```{r FindClusters}
# CPU intensive (skip during demo)
?FindClusters
pbmc <- FindClusters(pbmc, pc.use = 1:10, resolution = 0.6, save.SNN = TRUE)
```
```{r TSNE}
# CPU intensive (skip during demo)
?RunTSNE
pbmc <- RunTSNE(pbmc, dims.use = 1:10, do.fast = TRUE)
TSNEPlot(pbmc)
```
Note that t-SNE is not PCA! The measurement of distance in a t-SNE plot is difficult to interpret, and is most helpful for the relationships of close neighbors. To better infer separation distance between the putative clusters, let's re-apply PCA.
```{r PCAPlot}
?PCAPlot
PCAPlot(pbmc)
```
# Cluster quality control
Let's look at the variance in the number of genes detected (`nGene`) and the percentage of mitochondrial gene expression (`mito.ratio`), to see if any abberant cells are clustering.
```{r qc}
?FeaturePlot
FeaturePlot(pbmc, "nGene", cols.use = c("grey","green"))
FeaturePlot(pbmc, "mito.ratio", cols.use = c("grey","red"))
```
# Find differentially expressed genes (cluster biomarkers)
Seurat can help you find markers that define clusters via differential expression. By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. `FindAllMarkers()` automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a gene to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of genes that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be signficiant and the most highly differentially expressed genes will likely still rise to the top.
Seurat has four tests for differential expression which can be set with the test.use parameter: ROC test ("roc"), t-test ("t"), LRT test based on zero-inflated data ("bimod", default), LRT test based on tobit-censoring models ("tobit") The ROC test returns the 'classification power' for any individual marker (ranging from 0 - random, to 1 - perfect).
```{r FindMarkers}
# CPU intensive (skip during demo)
?FindAllMarkers
pbmc.markers <- FindAllMarkers(
pbmc, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_diff)
```
```{r markers_VlnPlot}
VlnPlot(pbmc, c("MS4A1","CD79A"))
VlnPlot(pbmc, c("NKG7","PF4"), use.raw = TRUE, y.log = TRUE)
```
```{r markers_FeaturePlot}
FeaturePlot(
pbmc, c("MS4A1", "GNLY","CD3E","CD14","FCER1A","FCGR3A", "LYZ",
"PPBP", "CD8A"),
cols.use = c("grey","blue"))
```
Heatmaps can also be a good way to examine heterogeneity within/between clusters. The `DoHeatmap()` function will generate a heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
```{r marker_DoHeatmap, fig.height=8, fig.width=5, warning=FALSE}
?DoHeatmap
pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_diff) -> top10
DoHeatmap(pbmc, genes.use = top10$gene, order.by.ident = TRUE,
slim.col.label = TRUE, remove.key = TRUE)
```
# Assign cell type identity to clusters
```{r rename_clusters}
current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells", "CD8 T cells",
"FCGR3A+ Monocytes", "NK cells", "Dendritic cells",
"Megakaryocytes")
pbmc@ident <- mapvalues(pbmc@ident,
from = current.cluster.ids,
to = new.cluster.ids)
```
Let's plot the tSNE again with our new labels:
```{r TSNEPlot}
TSNEPlot(pbmc, do.label = TRUE, pt.size = 0.5)
```
* * *
# Methods
```{r sessionInfo}
sessionInfo()
```
# References
[bcbio-nextgen]: https://bcbio-nextgen.readthedocs.io
[bcl2fastq]: https://support.illumina.com/downloads/bcl2fastq-conversion-software-v217.html
[biomaRt]: https://bioconductor.org/packages/release/bioc/html/biomaRt.html
[Ensembl]: http://useast.ensembl.org/Drosophila_melanogaster/Info/Index
[inDrop]: http://1cell-bio.com
[Matrix]: https://cran.r-project.org/web/packages/Matrix/index.html
[Orchestra]: https://wiki.med.harvard.edu/Orchestra
[R]: https://www.r-project.org
[rapmap]: https://github.com/COMBINE-lab/RapMap
[scRNA-Seq]: http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#single-cell-rna-sequencing
[Seurat]: http://satijalab.org/seurat
[umis]: https://github.com/vals/umis