Update readme and vignettes

ChristophH · ChristophH · commit 3798470a6997 · 2019-03-14T01:10:19.000+01:00
diff --git a/README.md b/README.md
@@ -1,9 +1,7 @@
 # sctransform
-## R package for modeling single cell UMI expression data using regularized negative binomial regression
+## R package for normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression
 
-This packaged was developed by Christoph Hafemeister in [Rahul Satija's lab](https://satijalab.org/) at the New York Genome Center. A previous version of this work was used in the paper [Developmental diversification of cortical inhibitory interneurons, Nature 555, 2018](https://github.com/ChristophH/in-lineage). We are currently working on integrating the functionality of this package into [Seurat](https://satijalab.org/seurat/), an R package designed for QC, analysis, and exploration of single cell RNA-seq data.
-
-This package is in beta status, please sanity check any results, and notify me of any issues you find.
+This packaged was developed by Christoph Hafemeister in [Rahul Satija's lab](https://satijalab.org/) at the New York Genome Center. Core functionality of this package has been integrated into [Seurat](https://satijalab.org/seurat/), an R package designed for QC, analysis, and exploration of single cell RNA-seq data.
 
 ## Quick start
 `devtools::install_github(repo = 'ChristophH/sctransform')`  
@@ -15,6 +13,7 @@ For usage examples see vignettes in inst/doc or use the built-in help after inst
 
 Available vignettes:  
 [Variance stabilizing transformation](https://rawgit.com/ChristophH/sctransform/master/inst/doc/variance_stabilizing_transformation.html)  
-[Differential expression](https://rawgit.com/ChristophH/sctransform/master/inst/doc/differential_expression.html)  
-[Batch correction](https://rawgit.com/ChristophH/sctransform/master/inst/doc/batch_correction.html)  
-[Denoising](https://rawgit.com/ChristophH/sctransform/master/inst/doc/denoising.html)  
+[Using sctransform in Seurat](https://rawgit.com/ChristophH/sctransform/master/inst/doc/seurat.html)  
+
+## Reference
+An early version of this work was used in the paper [Developmental diversification of cortical inhibitory interneurons, Nature 555, 2018](https://github.com/ChristophH/in-lineage).
diff --git a/inst/doc/correcting.R b/inst/doc/correcting.R
@@ -34,7 +34,7 @@ maturation_score <- pricu$lambda/max(pricu$lambda)
 y_smooth <- sctransform::smooth_via_pca(vst_out$y, do_plot = TRUE)
 
 ## ------------------------------------------------------------------------
-cm_denoised <- sctransform::denoise(vst_out, data = y_smooth, show_progress = FALSE)
+cm_corrected <- sctransform::correct(vst_out, data = y_smooth, show_progress = FALSE)
 
 ## ---- fig.width=7, fig.height=7, out.width='100%'------------------------
 goi <- c('Nes', 'Ccnd2', 'Tuba1a')
@@ -46,10 +46,10 @@ df[[2]] <- melt(t(as.matrix(vst_out$y[goi, ])), varnames = c('cell', 'gene'), va
 df[[2]]$type <- 'Pearson residual'
 df[[2]]$maturation_rank <- rank(maturation_score)
 df[[3]] <- melt(t(as.matrix(y_smooth[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
-df[[3]]$type <- 'de-noised Pearson residual'
+df[[3]]$type <- 'corrected Pearson residual'
 df[[3]]$maturation_rank <- rank(maturation_score)
-df[[4]] <- melt(t(as.matrix(cm_denoised[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
-df[[4]]$type <- 'de-noised UMI'
+df[[4]] <- melt(t(as.matrix(cm_corrected[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
+df[[4]]$type <- 'corrected UMI'
 df[[4]]$maturation_rank <- rank(maturation_score)
 df <- do.call(rbind, df)
 df$gene <- factor(df$gene, ordered=TRUE, levels=unique(df$gene))
diff --git a/inst/doc/correcting.Rmd b/inst/doc/correcting.Rmd
@@ -1,10 +1,10 @@
 ---
-title: "Denoising"
+title: "Correcting UMI counts"
 author: "Christoph Hafemeister"
 date: "`r Sys.Date()`"
 output: html_document
 vignette: >
-  %\VignetteIndexEntry{Denoising}
+  %\VignetteIndexEntry{Correcting UMI counts}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
@@ -26,10 +26,10 @@ knitr::opts_chunk$set(
 old_theme <- theme_set(theme_classic(base_size=8))
 ```
 
-In this vignette we show how the regression model in the variance stabilizing transformation can be used to output de-noised data. Implicitly there are two levels of smoothing/de-noising when we apply the standard workflow using `vst`. First we specify latent variables that are used in the regression model - their contribution to the overall variance in the data will be removed. Second, we usually perform dimensionality reduction, which acts like a smoothing operation. Here we show how to reverse these two operations to obtain de-noised UMI counts.
+In this vignette we show how the regression model in the variance stabilizing transformation can be used to output corrected data. Implicitly there are two levels of smoothing/de-noising when we apply the standard workflow using `vst`. First we specify latent variables that are used in the regression model - their contribution to the overall variance in the data will be removed. Second, we usually perform dimensionality reduction, which acts like a smoothing operation. Here we show how to reverse these two operations to obtain corrected UMI counts.
 
 ## Load data and transform
-We use data from a recent publication: [Mayer, Hafemeister, Bandler et al., Nature 2018](https://dx.doi.org/10.1038/nature25999) [(free read-only version)](http://rdcu.be/JA5l). We load a subset of the cells, namely one of the CGE E12.5 dropseq samples with contaminating cell populations removed. These cells come from a developing continuum and provide a nice example for de-noising.
+We use data from a recent publication: [Mayer, Hafemeister, Bandler et al., Nature 2018](https://dx.doi.org/10.1038/nature25999) [(free read-only version)](http://rdcu.be/JA5l). We load a subset of the cells, namely one of the CGE E12.5 dropseq samples with contaminating cell populations removed. These cells come from a developing continuum and provide a nice example for de-noising and count correction.
 
 First load the data and run variance stabilizing transformation.
 ```{r}
@@ -58,13 +58,13 @@ We will now smooth the Pearson residual by PCA. Internally `smooth_via_pca` perf
 y_smooth <- sctransform::smooth_via_pca(vst_out$y, do_plot = TRUE)
 ```
 
-The data matrix `y_smooth` is in Pearson residual space. Based on these values we can reverse the negative binomial regression model to derive UMI counts per gene. To remove the variability from the latent factor (here `log_umi_per_gene` as a proxy of sequencing depth), we can use a fixed value for all cells. The next step uses the smoothed Pearson residual and the median of all latent factors to obtain de-noised UMI counts.
+The data matrix `y_smooth` is in Pearson residual space. Based on these values we can reverse the negative binomial regression model to derive UMI counts per gene. To remove the variability from the latent factor (here `log_umi_per_gene` as a proxy of sequencing depth), we can use a fixed value for all cells. The next step uses the smoothed Pearson residual and the median of all latent factors to obtain corrected UMI counts.
 
 ```{r}
-cm_denoised <- sctransform::denoise(vst_out, data = y_smooth, show_progress = FALSE)
+cm_corrected <- sctransform::correct(vst_out, data = y_smooth, show_progress = FALSE)
 ```
 
-To give a better idea of what the data really looks like we will plot UMI, Pearson residual, de-noised Pearson residual, and de-noised UMI counts for some key genes related to neuronal development.
+To give a better idea of what the data really looks like we will plot UMI, Pearson residual, corrected Pearson residual, and corrected UMI counts for some key genes related to neuronal development.
 
 ```{r, fig.width=7, fig.height=7, out.width='100%'}
 goi <- c('Nes', 'Ccnd2', 'Tuba1a')
@@ -76,10 +76,10 @@ df[[2]] <- melt(t(as.matrix(vst_out$y[goi, ])), varnames = c('cell', 'gene'), va
 df[[2]]$type <- 'Pearson residual'
 df[[2]]$maturation_rank <- rank(maturation_score)
 df[[3]] <- melt(t(as.matrix(y_smooth[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
-df[[3]]$type <- 'de-noised Pearson residual'
+df[[3]]$type <- 'corrected Pearson residual'
 df[[3]]$maturation_rank <- rank(maturation_score)
-df[[4]] <- melt(t(as.matrix(cm_denoised[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
-df[[4]]$type <- 'de-noised UMI'
+df[[4]] <- melt(t(as.matrix(cm_corrected[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
+df[[4]]$type <- 'corrected UMI'
 df[[4]]$maturation_rank <- rank(maturation_score)
 df <- do.call(rbind, df)
 df$gene <- factor(df$gene, ordered=TRUE, levels=unique(df$gene))
diff --git a/inst/doc/seurat.R b/inst/doc/seurat.R
@@ -0,0 +1,44 @@
+## ----setup, include = FALSE----------------------------------------------
+library('Matrix')
+library('ggplot2')
+library('reshape2')
+library('sctransform')
+library('knitr')
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  digits = 2,
+  fig.width=8, fig.height=5, dpi=100, out.width = '70%'
+)
+library('Seurat')
+#old_theme <- theme_set(theme_classic(base_size=8))
+
+## ----eval=FALSE----------------------------------------------------------
+#  devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
+#  devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
+#  library(Seurat)
+#  library(sctransform)
+
+## ----load_data, warning=FALSE, message=FALSE, cache = T------------------
+pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
+pbmc <- CreateSeuratObject(counts = pbmc_data)
+
+## ----apply_sct, warning=FALSE, message=FALSE, cache = T------------------
+# Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
+# Transformed data will be available in the SCT assay, which is set as the default after running sctransform
+pbmc <- SCTransform(object = pbmc, verbose = FALSE)
+
+## ----pca, fig.width=5, fig.height=5, cache = T---------------------------
+# These are now standard steps in the Seurat workflow for visualization and clustering
+pbmc <- RunPCA(object = pbmc, verbose = FALSE)
+pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
+pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
+pbmc <- FindClusters(object = pbmc, verbose = FALSE)
+DimPlot(object = pbmc, label = TRUE) + NoLegend()
+
+## ----fplot, fig.width = 10, fig.height=10, cache = F---------------------
+# These are now standard steps in the Seurat workflow for visualization and clustering
+FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
+FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
+FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
+
diff --git a/inst/doc/seurat.Rmd b/inst/doc/seurat.Rmd
@@ -0,0 +1,68 @@
+---
+title: "Using sctransform in Seurat"
+author: "Christoph Hafemeister & Rahul Satija"
+date: '`r Sys.Date()`'
+output:
+  html_document: default
+  pdf_document: default
+vignette: >
+  %\VignetteIndexEntry{Using sctransform in Seurat}  
+  %\VignetteEngine{knitr::rmarkdown}  
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+library('Matrix')
+library('ggplot2')
+library('reshape2')
+library('sctransform')
+library('knitr')
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  digits = 2,
+  fig.width=8, fig.height=5, dpi=100, out.width = '70%'
+)
+library('Seurat')
+#old_theme <- theme_set(theme_classic(base_size=8))
+```
+
+This vignette shows how to use the sctransform wrapper in Seurat.
+Install sctransform and Seurat v3
+```{r eval=FALSE}
+devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
+devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
+library(Seurat)
+library(sctransform)
+```
+Load data and create Seurat object
+```{r load_data, warning=FALSE, message=FALSE, cache = T}
+pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
+pbmc <- CreateSeuratObject(counts = pbmc_data)
+```
+Apply sctransform normalization
+```{r apply_sct, warning=FALSE, message=FALSE, cache = T}
+# Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
+# Transformed data will be available in the SCT assay, which is set as the default after running sctransform
+pbmc <- SCTransform(object = pbmc, verbose = FALSE)
+```
+Perform dimensionality reduction by PCA and UMAP embedding
+```{r pca, fig.width=5, fig.height=5, cache = T}
+# These are now standard steps in the Seurat workflow for visualization and clustering
+pbmc <- RunPCA(object = pbmc, verbose = FALSE)
+pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
+pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
+pbmc <- FindClusters(object = pbmc, verbose = FALSE)
+DimPlot(object = pbmc, label = TRUE) + NoLegend()
+```
+Users can individually annotate clusters based on canonical markers. However, the sctransform normalization reveals sharper biological distinctions compared to the [standard Seurat workflow](https://satijalab.org/seurat/pbmc3k_tutorial.html), in a few ways:
+ * Clear separation of three CD8 T cell populations (naive, memory, effector), based on CD8A, GZMK, CCL5, GZMK expression
+ * Clear separation of three CD4 T cell populations (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15 
+ * Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
+ * Additional separation of NK cells into CD56dim vs. bright clusters, based on XCL1 and FCGR3A 
+```{r fplot, fig.width = 10, fig.height=10, cache = F}
+# These are now standard steps in the Seurat workflow for visualization and clustering
+FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
+FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
+FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
+```
diff --git a/vignettes/seurat.Rmd b/vignettes/seurat.Rmd
@@ -1,12 +1,14 @@
 ---
 title: "Using sctransform in Seurat"
-author: "Christoph Hafemeister"
+author: "Christoph Hafemeister & Rahul Satija"
 date: '`r Sys.Date()`'
 output:
   html_document: default
   pdf_document: default
-vignette: |
-  %\VignetteIndexEntry{Using sctransform in Seurat}  %\VignetteEngine{knitr::rmarkdown}  %\VignetteEncoding{UTF-8}
+vignette: >
+  %\VignetteIndexEntry{Using sctransform in Seurat}  
+  %\VignetteEngine{knitr::rmarkdown}  
+  %\VignetteEncoding{UTF-8}
 ---
 
 ```{r setup, include = FALSE}
@@ -26,27 +28,51 @@ library('Seurat')
 ```
 
 This vignette shows how to use the sctransform wrapper in Seurat.
+Install sctransform and Seurat v3
 
-Load data
-```{r load_data}
-pbmc_data <- readRDS(file = "~/Projects/data/pbmc3k_umi_counts.Rds")
-class(x = pbmc_data)
-dim(x = pbmc_data)
+```{r eval=FALSE}
+devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
+devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
+library(Seurat)
+library(sctransform)
 ```
 
-Create Seurat object
-```{r create_s, warning=FALSE}
-s <- CreateSeuratObject(counts = pbmc_data)
+Load data and create Seurat object
+
+```{r load_data, warning=FALSE, message=FALSE, cache = T}
+pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
+pbmc <- CreateSeuratObject(counts = pbmc_data)
 ```
 
 Apply sctransform normalization
-```{r apply_sct}
-s <- SCTransform(object = s, verbose = FALSE)
+
+```{r apply_sct, warning=FALSE, message=FALSE, cache = T}
+# Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
+# Transformed data will be available in the SCT assay, which is set as the default after running sctransform
+pbmc <- SCTransform(object = pbmc, verbose = FALSE)
 ```
 
 Perform dimensionality reduction by PCA and UMAP embedding
-```{r pca, fig.width=5, fig.height=5}
-s <- RunPCA(object = s, npcs = 20, verbose = FALSE)
-s <- RunUMAP(object = s, dims = 1:20, verbose = FALSE)
-DimPlot(object = s) + NoLegend()
+
+```{r pca, fig.width=5, fig.height=5, cache = T}
+# These are now standard steps in the Seurat workflow for visualization and clustering
+pbmc <- RunPCA(object = pbmc, verbose = FALSE)
+pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
+pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
+pbmc <- FindClusters(object = pbmc, verbose = FALSE)
+DimPlot(object = pbmc, label = TRUE) + NoLegend()
+```
+
+Users can individually annotate clusters based on canonical markers. However, the sctransform normalization reveals sharper biological distinctions compared to the [standard Seurat workflow](https://satijalab.org/seurat/pbmc3k_tutorial.html), in a few ways:
+
+* Clear separation of three CD8 T cell populations (naive, memory, effector), based on CD8A, GZMK, CCL5, GZMK expression
+* Clear separation of three CD4 T cell populations (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15 
+* Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
+* Additional separation of NK cells into CD56dim vs. bright clusters, based on XCL1 and FCGR3A 
+
+```{r fplot, fig.width = 10, fig.height=10, cache = F}
+# These are now standard steps in the Seurat workflow for visualization and clustering
+FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
+FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
+FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
 ```