Update documentation

satijalab · Jan 8, 2022 · e4deca0 · e4deca0
1 parent c3111d3
commit e4deca0
Show file tree

Hide file tree

Showing 13 changed files with 133 additions and 45 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,16 +1,22 @@
 Package: sctransform
 Type: Package
 Title: Variance Stabilizing Transformations for Single Cell UMI Data
-Version: 0.3.2.9009
-Authors@R: person(given = 'Christoph', family = 'Hafemeister', email = 'christoph.hafemeister@nyu.edu', role = c('aut', 'cre'), comment = c(ORCID = '0000-0001-6365-8254'))
+Version: 0.3.2.9010
+Date: 2022-01-08
+Authors@R: c(
+  person(given = "Christoph", family = "Hafemeister", email = "christoph.hafemeister@nyu.edu", role = c("aut", "cre"), comment = c(ORCID = "0000-0001-6365-8254")),
+  person(given = "Saket", family = "Choudhary", email = "schoudhary@nygenome.org", role = "ctb", comment = c(ORCID = "0000-0001-5202-7633")),
+  person(given = "Rahul", family = "Satija", email = "rsatija@nygenome.org", role = "ctb", comment = c(ORCID = "0000-0001-9448-8833"))
+  )
 Description: A normalization method for single-cell UMI count data using a 
   variance stabilizing transformation. The transformation is based on a 
   negative binomial regression model with regularized parameters. As part of the
   same regression framework, this package also provides functions for
-  batch correction, and data correction. See Hafemeister and Satija 2019 
-  <doi:10.1186/s13059-019-1874-1> for more details.
-URL: https://github.com/ChristophH/sctransform
-BugReports: https://github.com/ChristophH/sctransform/issues
+  batch correction, and data correction. See Hafemeister and Satija (2019)
+  <doi:10.1186/s13059-019-1874-1>, and Choudhary and Satija (2021) <doi:10.1101/2021.07.07.451498>
+  for more details.
+URL: https://github.com/satijalab/sctransform
+BugReports: https://github.com/satijalab/sctransform/issues
 License: GPL-3 | file LICENSE
 Encoding: UTF-8
 LazyData: true

diff --git a/NAMESPACE b/NAMESPACE
@@ -62,6 +62,7 @@ importFrom(stats,predict)
 importFrom(stats,t.test)
 importFrom(stats,var)
 importFrom(utils,capture.output)
+importFrom(utils,packageVersion)
 importFrom(utils,setTxtProgressBar)
 importFrom(utils,txtProgressBar)
 useDynLib(sctransform)
diff --git a/NEWS.md b/NEWS.md
@@ -1,6 +1,23 @@
 # News
 All notable changes will be documented in this file.
 
+## [0.3.3] - UNRELEASED
+
+### Added
+- `vst.flavor` argument to  `vst()` to allow for invoking running updated regularization (sctransform v2, proposed in [Satija and Choudhary, 2021](https://doi.org/10.1101/2021.07.07.451498). See paper for details.
+- `scale_factor` to `correct()` to allow for a custom library size when correcting counts
+
+
+## [0.3.2.9008] - 2021-07-28
+### Added
+- Add future.seed = TRUE to all `future_lapply()` calls
+
+### Changed
+- Wrap MASS::theta.ml() in suppressWarnings()
+
+### Fixed
+- Fix logical comparison of vectors of length one in `diff_mean_test()`
+
 ## [0.3.2.9003] - 2020-02-11
 ### Added
 - `compare` argument to the nonparametric differential expression test `diff_mean_test()` to allow for multiple comparisons and various ways to specify which groups to compare
@@ -39,7 +56,7 @@ All notable changes will be documented in this file.
 - Remove `poisson_fast` method (replaced by `qpoisson`)
 - Use `matrixStats` package and remove `RcppEigen` dependency
 - Use quasi poisson regression where possible
-- Define cell detection event as counts >= 0.01 (instead of > 0) - this only matters to people playing around with fractional counts (see [issue #65](https://github.com/ChristophH/sctransform/issues/65))
+- Define cell detection event as counts >= 0.01 (instead of > 0) - this only matters to people playing around with fractional counts (see [issue #65](https://github.com/satijalab/sctransform/issues/65))
 - Internal code restructuring and improvements
 
 ### Fixed

diff --git a/R/denoise.R b/R/denoise.R
@@ -160,6 +160,8 @@ correct <- function(x, data = 'y', cell_attr = x$cell_attr, as_is = FALSE,
 #' @param x A list that provides model parameters and optionally meta data; use output of vst function
 #' @param umi The count matrix
 #' @param cell_attr Provide cell meta data holding latent data info
+#' @param scale_factor Replace all values of UMI in the regression model by this value. Default is NA
+#' which uses median of total UMI as the latent factor.
 #' @param verbosity An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2
 #' @param verbose Deprecated; use verbosity instead
 #' @param show_progress Deprecated; use verbosity instead

diff --git a/R/utils.R b/R/utils.R
@@ -495,13 +495,13 @@ get_model_var <- function(vst_out, cell_attr = vst_out$cell_attr, use_nonreg = F
 
 #' Get median of non zero UMIs from a count matrix using a subset of genes (slow)
 #'
-#' @param cm Count matrix
+#' @param umi Count matrix
 #' @param genes List of genes to calculate statistics. Default is NULL which returns the non-zero median using all genes
 #'
 #' @return A numeric value representing the median of non-zero entries from the UMI matrix
 get_nz_median <- function(umi, genes = NULL){
   cm.T <- Matrix::t(umi)
-  n_g <- dim(cm)[1]
+  n_g <- dim(umi)[1]
   allnonzero <- c()
   if (is.null(genes)) {
     gene_index <- seq(1, nrow(umi))
@@ -517,10 +517,10 @@ get_nz_median <- function(umi, genes = NULL){
 
 #' Get median of non zero UMIs from a count matrix
 #'
-#' @param cm Count matrix
+#' @param umi Count matrix
 #'
 #' @return A numeric value representing the median of non-zero entries from the UMI matrix
-get_nz_median2 <- function(umi, genes = NULL){
+get_nz_median2 <- function(umi){
   return (median(umi@x))
 }
 
diff --git a/R/vst.R b/R/vst.R
@@ -34,10 +34,11 @@ NULL
 #' @param gmean_eps Small value added when calculating geometric mean of a gene to avoid log(0); default is 1
 #' @param theta_estimation_fun Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative
 #' @param theta_given If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL
+#' @param exclude_poisson Exclude poisson genes (i.e. mu < 0.001 or mu > variance) from regularization; default is FALSE
 #' @param use_geometric_mean Use geometric mean instead of arithmetic mean for all calculations ; default is TRUE
-#' @param use_geometric_mean_offset Use geoemtric mean insteaf of arithmetic mean in the offset model; default is FALSE
+#' @param use_geometric_mean_offset Use geometric mean instead of arithmetic mean in the offset model; default is FALSE
 #' @param fix_intercept Fix intercept as defined in the offset model; default is FALSE
-#' @param fix_slope Fix slope to log(10) (eqivalent to using library size as an offset); default is FALSE
+#' @param fix_slope Fix slope to log(10) (equivalent to using library size as an offset); default is FALSE
 #' @param scale_factor Replace all values of UMI in the regression model by this value instead of the median UMI; default is NA
 #' @param vst.flavor When set to `v2` sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; default is NULL which uses the original sctransform model
 #' @param verbosity An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2
@@ -97,6 +98,7 @@ NULL
 #' @importFrom stats glm glm.fit df.residual ksmooth model.matrix as.formula approx density poisson var bw.SJ
 #' @importFrom utils txtProgressBar setTxtProgressBar capture.output
 #' @importFrom methods as
+#' @importFrom utils packageVersion
 #'
 #' @export
 #'
@@ -206,7 +208,6 @@ vst <- function(umi,
   umi <- umi[genes, ]
   if (use_geometric_mean){
     genes_log_gmean <- log10(row_gmean(umi, eps = gmean_eps))
-
   } else {
     genes_log_gmean <- log10(rowMeans(umi))
   }
@@ -312,7 +313,8 @@ vst <- function(umi,
     model_pars_fit <- reg_model_pars(model_pars, genes_log_gmean_step1, genes_log_gmean, cell_attr,
                                      batch_var, cells_step1, genes_step1, umi, bw_adjust, gmean_eps,
                                      theta_regularization, genes_amean, genes_var,
-                                     exclude_poisson, fix_intercept, fix_slope, use_geometric_mean_offset, verbosity)
+                                     exclude_poisson, fix_intercept, fix_slope,
+                                     use_geometric_mean, use_geometric_mean_offset, verbosity)
     model_pars_outliers <- attr(model_pars_fit, 'outliers')
   } else {
     model_pars_fit <- model_pars
@@ -710,7 +712,8 @@ reg_model_pars <- function(model_pars, genes_log_gmean_step1, genes_log_gmean, c
                            batch_var, cells_step1, genes_step1, umi, bw_adjust, gmean_eps,
                            theta_regularization,
                            genes_amean = NULL, genes_var = NULL, exclude_poisson = FALSE,
-                           fix_intercept = FALSE, fix_slope = FALSE, use_geometric_mean_offset = FALSE, verbosity = 0) {
+                           fix_intercept = FALSE, fix_slope = FALSE, use_geometric_mean = TRUE,
+                           use_geometric_mean_offset = FALSE, verbosity = 0) {
   genes <- names(genes_log_gmean)
   if (exclude_poisson | fix_slope | fix_intercept){
     # exclude this from the fitting procedure entirely

diff --git a/README.md b/README.md
@@ -4,48 +4,65 @@
 This package was developed by Christoph Hafemeister in [Rahul Satija's lab](https://satijalab.org/) at the New York Genome Center. Core functionality of this package has been integrated into [Seurat](https://satijalab.org/seurat/), an R package designed for QC, analysis, and exploration of single cell RNA-seq data.
 
 ## Quick start
-`devtools::install_github(repo = 'ChristophH/sctransform')`  
-`normalized_data <- sctransform::vst(umi_count_matrix)$y`
 
-(you can also install from CRAN: `install.packages('sctransform'))`)
+```r
+# Install sctransform from CRAN
+# install.packages("sctransform")
+
+# Or the development version from GitHub:
+# install.packages("remotes")
+remotes::install_github("satijalab/sctransform", ref="develop")
+
+normalized_data <- sctransform::vst(umi_count_matrix)$y
+```
+
+To invoke the `v2` flavor:
+
+```r
+normalized_data <- sctransform::vst(umi_count_matrix, vst.flavor="v2")$y
+
+# Using Seurat
+seurat_object <- Seurat::SCTransform(seurat_object, vst.flavor="v2")
+```
 
 ## Help
+
 For usage examples see vignettes in inst/doc or use the built-in help after installation  
 `?sctransform::vst`  
 
 Available vignettes:  
-[Variance stabilizing transformation](https://rawgit.com/ChristophH/sctransform/supp_html/supplement/variance_stabilizing_transformation.html)  
-[Using sctransform in Seurat](https://rawgit.com/ChristophH/sctransform/supp_html/supplement/seurat.html)  
 
-## Known Issues
+- [Variance stabilizing transformation](https://rawgit.com/satijalab/sctransform/supp_html/supplement/variance_stabilizing_transformation.html)  
+- [Using sctransform in Seurat](https://rawgit.com/satijalab/sctransform/supp_html/supplement/seurat.html)  
 
-* `Error in is.nan` when a batch variable is used. Fixed in the develop branch. ([issue #88](https://github.com/ChristophH/sctransform/issues/88))
-* `node stack overflow` error when Rfast package is loaded. The Rfast package does not play nicely with the future.apply package. Try to avoid loading the Rfast package. See discussions: https://github.com/RfastOfficial/Rfast/issues/5 https://github.com/ChristophH/sctransform/issues/108
+## Known Issues
 
-To install from the develop branch run `remotes::install_github("ChristophH/sctransform@develop")`
+* `node stack overflow` error when Rfast package is loaded. The Rfast package does not play nicely with the future.apply package. Try to avoid loading the Rfast package. See discussions: https://github.com/RfastOfficial/Rfast/issues/5 https://github.com/satijalab/sctransform/issues/108
 
-Please use [the issue tracker](https://github.com/ChristophH/sctransform/issues) if you encounter a problem
+Please use [the issue tracker](https://github.com/satijalab/sctransform/issues) if you encounter a problem
 
 ## News
-For a detailed change log have a look at the file [NEWS.md](https://github.com/ChristophH/sctransform/blob/master/NEWS.md)
+For a detailed change log have a look at the file [NEWS.md](https://github.com/satijalab/sctransform/blob/master/NEWS.md)
 
 ### v0.3.2
-This release improves the coefficient initialization in quasi poisson regression that sometimes led to errors. There are also some minor bug fixes and a new non-parametric differential expression test for sparse non-negative data (`diff_mean_test`, [this vignette](https://rawgit.com/ChristophH/sctransform/supp_html/supplement/np_diff_mean_test.html) gives some details).
+This release improves the coefficient initialization in quasi poisson regression that sometimes led to errors. There are also some minor bug fixes and a new non-parametric differential expression test for sparse non-negative data (`diff_mean_test`, [this vignette](https://rawgit.com/satijalab/sctransform/supp_html/supplement/np_diff_mean_test.html) gives some details).
 
 ### v0.3.1
 This release fixes a performance regression when `sctransform::vst` was called via `do.call`, as is the case in the Seurat wrapper. 
 
 Additionally, model fitting is significantly faster now, because we use a fast Rcpp quasi poisson regression implementation (based on `Rfast` package). This applies to methods `poisson`, `qpoisson` and `nb_fast`.
 
-The `qpoisson` method is new and uses the dispersion parameter from the quasi poisson regression directly to estimate `theta` for the NB model. This can speed up the model fitting step considerably, while giving similar results to the other methods. [This vignette](https://rawgit.com/ChristophH/sctransform/supp_html/supplement/method_comparison.html) compares the methods.
+The `qpoisson` method is new and uses the dispersion parameter from the quasi poisson regression directly to estimate `theta` for the NB model. This can speed up the model fitting step considerably, while giving similar results to the other methods. [This vignette](https://rawgit.com/satijalab/sctransform/supp_html/supplement/method_comparison.html) compares the methods.
 
 ### v0.3
-The latest version of `sctransform` now supports the [glmGamPoi](https://github.com/const-ae/glmGamPoi) package to speed up the model fitting step. You can see more about the different methods supported and how they compare in terms of results and speed [in this new vignette](https://rawgit.com/ChristophH/sctransform/supp_html/supplement/method_comparison.html).
+The latest version of `sctransform` now supports the [glmGamPoi](https://github.com/const-ae/glmGamPoi) package to speed up the model fitting step. You can see more about the different methods supported and how they compare in terms of results and speed [in this new vignette](https://rawgit.com/satijalab/sctransform/supp_html/supplement/method_comparison.html).
+
+Also note that default theta regularization is now based on overdispersion factor (`1 + m / theta` where m is the geometric mean of the observed counts) not `log10(theta)`. The old behavior is still available via `theta_regularization` parameter. You can see how this changes (or doesn't change) the results [in this new vignette](https://rawgit.com/satijalab/sctransform/supp_html/supplement/theta_regularization.html).
+
 
-Also note that default theta regularization is now based on overdispersion factor (`1 + m / theta` where m is the geometric mean of the observed counts) not `log10(theta)`. The old behavior is still available via `theta_regularization` parameter. You can see how this changes (or doesn't change) the results [in this new vignette](https://rawgit.com/ChristophH/sctransform/supp_html/supplement/theta_regularization.html).
+## References
 
+- Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20, 296 (December 23, 2019).  [https://doi.org/10.1186/s13059-019-1874-1](https://doi.org/10.1186/s13059-019-1874-1). An early version of this work was used in the paper [Developmental diversification of cortical inhibitory interneurons, Nature 555, 2018](https://github.com/ChristophH/in-lineage).
 
-## Reference
-Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20, 296 (December 23, 2019). [https://doi.org/10.1186/s13059-019-1874-1](https://doi.org/10.1186/s13059-019-1874-1)
+- Choudhary, S. & Satija, R. Comparison and evaluation of statistical error models for scRNA-seq. bioRxiv (2021). [https://doi.org/10.1101/2021.07.07.451498](https://doi.org/10.1101/2021.07.07.451498)
 
-An early version of this work was used in the paper [Developmental diversification of cortical inhibitory interneurons, Nature 555, 2018](https://github.com/ChristophH/in-lineage).
diff --git a/man/correct.Rd b/man/correct.Rd
diff --git a/man/correct_counts.Rd b/man/correct_counts.Rd
diff --git a/man/get_nz_median.Rd b/man/get_nz_median.Rd
diff --git a/man/get_nz_median2.Rd b/man/get_nz_median2.Rd