Skip to content

Commit

Permalink
Update readme and vignettes
Browse files Browse the repository at this point in the history
  • Loading branch information
ChristophH committed Mar 14, 2019
1 parent 0127a35 commit 3798470
Show file tree
Hide file tree
Showing 6 changed files with 175 additions and 38 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# sctransform
## R package for modeling single cell UMI expression data using regularized negative binomial regression
## R package for normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

This packaged was developed by Christoph Hafemeister in [Rahul Satija's lab](https://satijalab.org/) at the New York Genome Center. A previous version of this work was used in the paper [Developmental diversification of cortical inhibitory interneurons, Nature 555, 2018](https://github.com/ChristophH/in-lineage). We are currently working on integrating the functionality of this package into [Seurat](https://satijalab.org/seurat/), an R package designed for QC, analysis, and exploration of single cell RNA-seq data.

This package is in beta status, please sanity check any results, and notify me of any issues you find.
This packaged was developed by Christoph Hafemeister in [Rahul Satija's lab](https://satijalab.org/) at the New York Genome Center. Core functionality of this package has been integrated into [Seurat](https://satijalab.org/seurat/), an R package designed for QC, analysis, and exploration of single cell RNA-seq data.

## Quick start
`devtools::install_github(repo = 'ChristophH/sctransform')`
Expand All @@ -15,6 +13,7 @@ For usage examples see vignettes in inst/doc or use the built-in help after inst

Available vignettes:
[Variance stabilizing transformation](https://rawgit.com/ChristophH/sctransform/master/inst/doc/variance_stabilizing_transformation.html)
[Differential expression](https://rawgit.com/ChristophH/sctransform/master/inst/doc/differential_expression.html)
[Batch correction](https://rawgit.com/ChristophH/sctransform/master/inst/doc/batch_correction.html)
[Denoising](https://rawgit.com/ChristophH/sctransform/master/inst/doc/denoising.html)
[Using sctransform in Seurat](https://rawgit.com/ChristophH/sctransform/master/inst/doc/seurat.html)

## Reference
An early version of this work was used in the paper [Developmental diversification of cortical inhibitory interneurons, Nature 555, 2018](https://github.com/ChristophH/in-lineage).
8 changes: 4 additions & 4 deletions inst/doc/denoising.R → inst/doc/correcting.R
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ maturation_score <- pricu$lambda/max(pricu$lambda)
y_smooth <- sctransform::smooth_via_pca(vst_out$y, do_plot = TRUE)

## ------------------------------------------------------------------------
cm_denoised <- sctransform::denoise(vst_out, data = y_smooth, show_progress = FALSE)
cm_corrected <- sctransform::correct(vst_out, data = y_smooth, show_progress = FALSE)

## ---- fig.width=7, fig.height=7, out.width='100%'------------------------
goi <- c('Nes', 'Ccnd2', 'Tuba1a')
Expand All @@ -46,10 +46,10 @@ df[[2]] <- melt(t(as.matrix(vst_out$y[goi, ])), varnames = c('cell', 'gene'), va
df[[2]]$type <- 'Pearson residual'
df[[2]]$maturation_rank <- rank(maturation_score)
df[[3]] <- melt(t(as.matrix(y_smooth[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[3]]$type <- 'de-noised Pearson residual'
df[[3]]$type <- 'corrected Pearson residual'
df[[3]]$maturation_rank <- rank(maturation_score)
df[[4]] <- melt(t(as.matrix(cm_denoised[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[4]]$type <- 'de-noised UMI'
df[[4]] <- melt(t(as.matrix(cm_corrected[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[4]]$type <- 'corrected UMI'
df[[4]]$maturation_rank <- rank(maturation_score)
df <- do.call(rbind, df)
df$gene <- factor(df$gene, ordered=TRUE, levels=unique(df$gene))
Expand Down
20 changes: 10 additions & 10 deletions inst/doc/denoising.Rmd → inst/doc/correcting.Rmd
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: "Denoising"
title: "Correcting UMI counts"
author: "Christoph Hafemeister"
date: "`r Sys.Date()`"
output: html_document
vignette: >
%\VignetteIndexEntry{Denoising}
%\VignetteIndexEntry{Correcting UMI counts}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
Expand All @@ -26,10 +26,10 @@ knitr::opts_chunk$set(
old_theme <- theme_set(theme_classic(base_size=8))
```

In this vignette we show how the regression model in the variance stabilizing transformation can be used to output de-noised data. Implicitly there are two levels of smoothing/de-noising when we apply the standard workflow using `vst`. First we specify latent variables that are used in the regression model - their contribution to the overall variance in the data will be removed. Second, we usually perform dimensionality reduction, which acts like a smoothing operation. Here we show how to reverse these two operations to obtain de-noised UMI counts.
In this vignette we show how the regression model in the variance stabilizing transformation can be used to output corrected data. Implicitly there are two levels of smoothing/de-noising when we apply the standard workflow using `vst`. First we specify latent variables that are used in the regression model - their contribution to the overall variance in the data will be removed. Second, we usually perform dimensionality reduction, which acts like a smoothing operation. Here we show how to reverse these two operations to obtain corrected UMI counts.

## Load data and transform
We use data from a recent publication: [Mayer, Hafemeister, Bandler et al., Nature 2018](https://dx.doi.org/10.1038/nature25999) [(free read-only version)](http://rdcu.be/JA5l). We load a subset of the cells, namely one of the CGE E12.5 dropseq samples with contaminating cell populations removed. These cells come from a developing continuum and provide a nice example for de-noising.
We use data from a recent publication: [Mayer, Hafemeister, Bandler et al., Nature 2018](https://dx.doi.org/10.1038/nature25999) [(free read-only version)](http://rdcu.be/JA5l). We load a subset of the cells, namely one of the CGE E12.5 dropseq samples with contaminating cell populations removed. These cells come from a developing continuum and provide a nice example for de-noising and count correction.

First load the data and run variance stabilizing transformation.
```{r}
Expand Down Expand Up @@ -58,13 +58,13 @@ We will now smooth the Pearson residual by PCA. Internally `smooth_via_pca` perf
y_smooth <- sctransform::smooth_via_pca(vst_out$y, do_plot = TRUE)
```

The data matrix `y_smooth` is in Pearson residual space. Based on these values we can reverse the negative binomial regression model to derive UMI counts per gene. To remove the variability from the latent factor (here `log_umi_per_gene` as a proxy of sequencing depth), we can use a fixed value for all cells. The next step uses the smoothed Pearson residual and the median of all latent factors to obtain de-noised UMI counts.
The data matrix `y_smooth` is in Pearson residual space. Based on these values we can reverse the negative binomial regression model to derive UMI counts per gene. To remove the variability from the latent factor (here `log_umi_per_gene` as a proxy of sequencing depth), we can use a fixed value for all cells. The next step uses the smoothed Pearson residual and the median of all latent factors to obtain corrected UMI counts.

```{r}
cm_denoised <- sctransform::denoise(vst_out, data = y_smooth, show_progress = FALSE)
cm_corrected <- sctransform::correct(vst_out, data = y_smooth, show_progress = FALSE)
```

To give a better idea of what the data really looks like we will plot UMI, Pearson residual, de-noised Pearson residual, and de-noised UMI counts for some key genes related to neuronal development.
To give a better idea of what the data really looks like we will plot UMI, Pearson residual, corrected Pearson residual, and corrected UMI counts for some key genes related to neuronal development.

```{r, fig.width=7, fig.height=7, out.width='100%'}
goi <- c('Nes', 'Ccnd2', 'Tuba1a')
Expand All @@ -76,10 +76,10 @@ df[[2]] <- melt(t(as.matrix(vst_out$y[goi, ])), varnames = c('cell', 'gene'), va
df[[2]]$type <- 'Pearson residual'
df[[2]]$maturation_rank <- rank(maturation_score)
df[[3]] <- melt(t(as.matrix(y_smooth[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[3]]$type <- 'de-noised Pearson residual'
df[[3]]$type <- 'corrected Pearson residual'
df[[3]]$maturation_rank <- rank(maturation_score)
df[[4]] <- melt(t(as.matrix(cm_denoised[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[4]]$type <- 'de-noised UMI'
df[[4]] <- melt(t(as.matrix(cm_corrected[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[4]]$type <- 'corrected UMI'
df[[4]]$maturation_rank <- rank(maturation_score)
df <- do.call(rbind, df)
df$gene <- factor(df$gene, ordered=TRUE, levels=unique(df$gene))
Expand Down
44 changes: 44 additions & 0 deletions inst/doc/seurat.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## ----setup, include = FALSE----------------------------------------------
library('Matrix')
library('ggplot2')
library('reshape2')
library('sctransform')
library('knitr')
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
digits = 2,
fig.width=8, fig.height=5, dpi=100, out.width = '70%'
)
library('Seurat')
#old_theme <- theme_set(theme_classic(base_size=8))

## ----eval=FALSE----------------------------------------------------------
# devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
# devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
# library(Seurat)
# library(sctransform)

## ----load_data, warning=FALSE, message=FALSE, cache = T------------------
pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc_data)

## ----apply_sct, warning=FALSE, message=FALSE, cache = T------------------
# Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
# Transformed data will be available in the SCT assay, which is set as the default after running sctransform
pbmc <- SCTransform(object = pbmc, verbose = FALSE)

## ----pca, fig.width=5, fig.height=5, cache = T---------------------------
# These are now standard steps in the Seurat workflow for visualization and clustering
pbmc <- RunPCA(object = pbmc, verbose = FALSE)
pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindClusters(object = pbmc, verbose = FALSE)
DimPlot(object = pbmc, label = TRUE) + NoLegend()

## ----fplot, fig.width = 10, fig.height=10, cache = F---------------------
# These are now standard steps in the Seurat workflow for visualization and clustering
FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)

68 changes: 68 additions & 0 deletions inst/doc/seurat.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
title: "Using sctransform in Seurat"
author: "Christoph Hafemeister & Rahul Satija"
date: '`r Sys.Date()`'
output:
html_document: default
pdf_document: default
vignette: >
%\VignetteIndexEntry{Using sctransform in Seurat}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library('Matrix')
library('ggplot2')
library('reshape2')
library('sctransform')
library('knitr')
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
digits = 2,
fig.width=8, fig.height=5, dpi=100, out.width = '70%'
)
library('Seurat')
#old_theme <- theme_set(theme_classic(base_size=8))
```

This vignette shows how to use the sctransform wrapper in Seurat.
Install sctransform and Seurat v3
```{r eval=FALSE}
devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
library(Seurat)
library(sctransform)
```
Load data and create Seurat object
```{r load_data, warning=FALSE, message=FALSE, cache = T}
pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc_data)
```
Apply sctransform normalization
```{r apply_sct, warning=FALSE, message=FALSE, cache = T}
# Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
# Transformed data will be available in the SCT assay, which is set as the default after running sctransform
pbmc <- SCTransform(object = pbmc, verbose = FALSE)
```
Perform dimensionality reduction by PCA and UMAP embedding
```{r pca, fig.width=5, fig.height=5, cache = T}
# These are now standard steps in the Seurat workflow for visualization and clustering
pbmc <- RunPCA(object = pbmc, verbose = FALSE)
pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindClusters(object = pbmc, verbose = FALSE)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
```
Users can individually annotate clusters based on canonical markers. However, the sctransform normalization reveals sharper biological distinctions compared to the [standard Seurat workflow](https://satijalab.org/seurat/pbmc3k_tutorial.html), in a few ways:
* Clear separation of three CD8 T cell populations (naive, memory, effector), based on CD8A, GZMK, CCL5, GZMK expression
* Clear separation of three CD4 T cell populations (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15
* Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
* Additional separation of NK cells into CD56dim vs. bright clusters, based on XCL1 and FCGR3A
```{r fplot, fig.width = 10, fig.height=10, cache = F}
# These are now standard steps in the Seurat workflow for visualization and clustering
FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
```
60 changes: 43 additions & 17 deletions vignettes/seurat.Rmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
title: "Using sctransform in Seurat"
author: "Christoph Hafemeister"
author: "Christoph Hafemeister & Rahul Satija"
date: '`r Sys.Date()`'
output:
html_document: default
pdf_document: default
vignette: |
%\VignetteIndexEntry{Using sctransform in Seurat} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
vignette: >
%\VignetteIndexEntry{Using sctransform in Seurat}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
Expand All @@ -26,27 +28,51 @@ library('Seurat')
```

This vignette shows how to use the sctransform wrapper in Seurat.
Install sctransform and Seurat v3

Load data
```{r load_data}
pbmc_data <- readRDS(file = "~/Projects/data/pbmc3k_umi_counts.Rds")
class(x = pbmc_data)
dim(x = pbmc_data)
```{r eval=FALSE}
devtools::install_github(repo = 'ChristophH/sctransform', ref = 'develop')
devtools::install_github(repo = 'satijalab/seurat', ref = 'release/3.0')
library(Seurat)
library(sctransform)
```

Create Seurat object
```{r create_s, warning=FALSE}
s <- CreateSeuratObject(counts = pbmc_data)
Load data and create Seurat object

```{r load_data, warning=FALSE, message=FALSE, cache = T}
pbmc_data <- Read10X(data.dir = "~/Downloads/pbmc3k_filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc_data)
```

Apply sctransform normalization
```{r apply_sct}
s <- SCTransform(object = s, verbose = FALSE)

```{r apply_sct, warning=FALSE, message=FALSE, cache = T}
# Note that this single command replaces NormalizeData, ScaleData, and FindVariableFeatures.
# Transformed data will be available in the SCT assay, which is set as the default after running sctransform
pbmc <- SCTransform(object = pbmc, verbose = FALSE)
```

Perform dimensionality reduction by PCA and UMAP embedding
```{r pca, fig.width=5, fig.height=5}
s <- RunPCA(object = s, npcs = 20, verbose = FALSE)
s <- RunUMAP(object = s, dims = 1:20, verbose = FALSE)
DimPlot(object = s) + NoLegend()

```{r pca, fig.width=5, fig.height=5, cache = T}
# These are now standard steps in the Seurat workflow for visualization and clustering
pbmc <- RunPCA(object = pbmc, verbose = FALSE)
pbmc <- RunUMAP(object = pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindNeighbors(object = pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindClusters(object = pbmc, verbose = FALSE)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
```

Users can individually annotate clusters based on canonical markers. However, the sctransform normalization reveals sharper biological distinctions compared to the [standard Seurat workflow](https://satijalab.org/seurat/pbmc3k_tutorial.html), in a few ways:

* Clear separation of three CD8 T cell populations (naive, memory, effector), based on CD8A, GZMK, CCL5, GZMK expression
* Clear separation of three CD4 T cell populations (naive, memory, IFN-activated) based on S100A4, CCR7, IL32, and ISG15
* Additional developmental sub-structure in B cell cluster, based on TCL1A, FCER2
* Additional separation of NK cells into CD56dim vs. bright clusters, based on XCL1 and FCGR3A

```{r fplot, fig.width = 10, fig.height=10, cache = F}
# These are now standard steps in the Seurat workflow for visualization and clustering
FeaturePlot(object = pbmc, features = c("CD8A","GZMK","CCL5","S100A4"), pt.size = 0.3)
FeaturePlot(object = pbmc, features = c("S100A4","CCR7","CD4","ISG15"), pt.size = 0.3)
FeaturePlot(object = pbmc, features = c("TCL1A","FCER2","XCL1","FCGR3A"), pt.size = 0.3)
```

0 comments on commit 3798470

Please sign in to comment.