V2.0 (#1)

* Test with matching gene sets * Parallelize calls to lldist by profile * Determine number of cores for mclapply * saving flightpath plot to /tmp/NBClust-Plots/ * better file naming with time formatting * file name without spaces * fixing 7X7 in for save size * typo fix for units * delete detritus in rescaleProfiles * iss161 update refineClusters * tidy vignettes * pass check() * make default settings more robust and use lower max_iters in phase 1 * update docs * partial progress * remove geosketch redundancy and speed up subsampling * remove dependency on rcolorbrewer * delete comments * add colorCellTypes function Function copied from ptolemy, for choosing a good initial set of colors for cell types * document colorcelltypes * add color info to vignettes * add colorCellTypes function to vignettes * fix iss165 * vignette bugfix * copy test script for colorCellTypes from ptolemy * Updated the temp folder creation logic * updated package version * Hardcode isRelease to true and comment out to deploy to all envs * New include * add specs and remove @exports * add missing @params * R CMD check fixes * new documentation * Remove exports * Fix imports * Update to MIT license * Add to .Rbuildignore * complete docs * Fix documentation * Reference global variable for R CMD check * doc update * Change package name to InSituType * Change package name in Rcpp files * Linting style changes * Update documentation * Unify names for vignettes * delete settings file * Remove writing plot to file * Keep as one row matrix * Render latest vignettes * BiocCheck items * Revert "Remove writing plot to file" * Specify package for dataset * Specify package for data * Update vignettes with reference to CosMx * Bump version * SingleCellExperiment option for core functions * Add dependency * Use one core for building vignettes * Add documentation * keep supercluster consister when subclustering subclustered cells are now required to fall within their original supercluster, even if another supercluster has a greater loglik * Bump to version 0.99.3 * add link to manuscript and citation * update README * correct dependencies flowchart in readme * remove dependency on lsa package in DESCRIPTION; faster way to calculate cell-level cosine similarity with celltype profiles in get_anchor_stats function * Don't track RStudio project file * Remove set.seed usage * Bump version * Change maintainer * Add examples * Merged PR 15109: Update azure-pipelines.yml for Azure Pipelines Related work items: #175413 * Update azure-pipelines.yml for Azure Pipelines * Updated trigger * Removed 1.0 environments from main branch * Added QA 1.1 environment * Updated local path for copying * Fixed variable * Update azure-pipelines.yml for Azure Pipelines * Fixed globExpression * message help wip * sync with main * explicit convert to dgCMatrix for ensuring access to counts2@x in next line * Re-add writing plot to disk. * Implement OpenMP parallelization for lldist * Don't save plot * Added license * Fix merge error * Restore saving plot * Use 64 bit integers * filter anchor, platform adjustment * fix umap in flightpath * add utilities func * Bump version to 1.1.0 * I add the updateReferenceProfiles function to estimate gene-level platform effect and rescale raw reference profile * update function name * replace the estimatePlatformEffects function with updated version * auto anchor selection & reference update * update estimatePlatformEffects function * update the name of output list * rename the output list name * anchor threshold, lost genes * anchor threshold * fix testthat * update mans * outlier beta range * lostgenes vs. blacklist * sparse array compatibility * net expr * default mc.cores option * handle glm error * fastglm method * remove warning_genes * fix 0 bg in Poisson start * fix zero bg in estimateBackground() * Remove unneeded dependency * Updated 13 files under /R * Updated 7 files under /src * Updated 32 files under /man * added protein insituType clustering * update InsituType.R * updated all files * testing folder updated * testing folder updated * InsituType function updated * Deleted rescale_reference.R * Updated gen_profiles_protein.R * Added gen_profiles_protein_old.R * Added 2 files to /data * Renamed default_signature_matrix.csv to human_signature_matrix.csv * Fix S4 method * Remove build outputs * Update docs * Deleted gen_profiles_protein_old.R * Updated nbclust.R * insitutype function updated for protein * gen_profiles_protein function updated * Add lsa as dependency * Add dependencies * update insitutypeML * update Rcpp functions * update lldist function * rcpp change * updated all files * Update pipeline for devnext. * updated chooseclusternumber for protein * Deleted NBClust-Plots * Deleted NBClust-Plots * added examples and commented lines * Add back ARMA fix * testing examples and tests files * Fix example and vignettes * Fix roxygen for Rcpp * Add dependencies * Deleted NBClust-Plots * Re-add dependencies * Fix errors in unit tests and examples * Fix example * Remove duplicates * calling signature matrix from the package * calling signature matrix from the package * updated vignette * updated the namespace * Deleted NBClust-Plots * Update version to 1.2.0 * Update ado pipeline. * Merged PR 20275: Create "undefined" profile for cells with zero counts Change behavior of InSituType in all modes--supervised, unsupervised, semi-supervised--to no longer error when there are cells with zero counts but instead set their cell type to "undefined". For the other error in linked work item generate a more informative message indicating that the profiles do not have sufficient shared genes with panel. Related work items: #201286 * Bump version for dev branch * fix compatability with platform correction * add default and limit choices for assay_type * Bump dev version to v1.2.2 * PD updates * update docs * update tests * update version and NEWS and README * code review updates * update news * delete duplicated functions * new docs * 2nd code review updates * update FAQ - new qc recommendation * Merged PR 21765: handle collinearity issues with fastCohorting handle collinearity issues with fastCohorting: 1. Reduce to 2 PC's. 2. If this fails, then try successively smaller # of cohorts with the 2 pc's. Related work items: #211163 * Bump version for dev to 1.2.3 * bugfix rownames(sds) was creating an error in rna mode. * Merged PR 21963: Update pipeline for dev branch. Update pipeline for dev branch. Related work items: #211130 * faqs update * updates per LW code review * vignette finessing * r cmd check updates * faq updates * remove vestigial plots --------- Co-authored-by: David Ross <dross@nanostring.com> Co-authored-by: r2ds <r2ds@Siddharth-Bhatis-Macbook-Pro.local> Co-authored-by: patrickjdanaher <patrickjdanaher@gmail.com> Co-authored-by: Maddy Griswold <mgriswold@nanostring.com> Co-authored-by: Augustine <augustine@procogia.com> Co-authored-by: siddharth.bhatia <siddharth.bhatia@procogia.com> Co-authored-by: Nikola <ndjokic@icefyresolutions.com> Co-authored-by: Dave Ross <48140684+davidpross@users.noreply.github.com> Co-authored-by: Patrick Danaher <pdanaher@nanostring.com> Co-authored-by: dan11mcguire <dan11mcguire@gmail.com> Co-authored-by: Maksim Gorelov <maksims.gorelovs@swdfactory.com> Co-authored-by: David Ross <dave@davidpross.com> Co-authored-by: Lidan Wu <lwu@nanostring.com> Co-authored-by: Yongfang <ylu@nanostring.com> Co-authored-by: lidanwu <lidanwu2016@gmail.com> Co-authored-by: Sangsoon Woo <sawoo@nanostring.com> Co-authored-by: Artyom Labin <artyom.labin@seattlebiosoftware.com> Co-authored-by: Dan McGuire <dmcguire@nanostring.com>
Nanostring-Biostats · Aug 7, 2024 · dc7819e · dc7819e
1 parent e06422f
commit dc7819e
Show file tree

Hide file tree

Showing 89 changed files with 6,724 additions and 531 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,32 +1,45 @@
 Package: InSituType
 Type: Package
 Title: An R package for performing cell typing in SMI and other single cell data
-Version: 1.0.0
+Version: 2.0
 Authors@R: c(person("Patrick", "Danaher", email = "pdanaher@nanostring.com", role = c("aut")),
+             person("Sangsoon", "Woo", email = "sawoo@nanostring.com", role = c("aut")),
              person("Zhi", "Yang", email = "zyang@nanostring.com", role = c("aut")),
-             person("David", "Ross", email = "dross@nanostring.com", role = c("aut", "cre")))
+             person("David", "Ross", email = "dross@nanostring.com", role = c("aut", "cre")),
+             person("Lidan", "Wu", email = "lwu@nanostring.com", role = c("aut")),
+             person("Yongfang", "Lu", email = "ylu@nanostring.com", role = c("aut")))
+
 Description: Insitutype is an algorithm for performing cell typing in single cell 
              spatial transcriptomics data, such as is generated by the CosMx platform. 
              It can perform supervised cell typing from reference profiles, unsupervised clustering,
              or semi-supervised cell typing in which cells both reference cell types and de novo
              clusters are fit. 
 Imports:
-  Matrix,
-  scales,
-  umap,
+  data.table,
+  dplyr,
+  fastglm,
   ggplot2,
+  graphics,
+  grDevices,
   irlba,
+  lsa,
+  magrittr,
+  Matrix,
   mclust,
-  sparseMatrixStats,
-  SummarizedExperiment,
-  SingleCellExperiment,
   methods,
+  Rcpp (>= 1.0.9),
   rlang,
-  grDevices,
-  graphics,
+  scales,
+  SingleCellExperiment,
+  sparseMatrixStats,
+  spatstat.geom,
   stats,
+  SummarizedExperiment,
+  tibble,
+  umap,
   utils,
-  Rcpp (>= 1.0.9)
+  uwot
+
 License: NanoString Technologies, Inc. Software License Agreement for Non-Commercial Use
 Encoding: UTF-8
 LazyData: true
@@ -37,5 +50,5 @@ Suggests:
 VignetteBuilder: knitr
 Depends:
   R (>= 3.5.0)
-RoxygenNote: 7.2.3
+RoxygenNote: 7.3.1
 LinkingTo: Rcpp, RcppArmadillo
diff --git a/FAQs.md b/FAQs.md
@@ -15,8 +15,33 @@ The broad Insitutype workflow is as follows:
 ![image](https://github.com/Nanostring-Biostats/InSituType/assets/4357938/45d89004-dc46-40a1-bde8-33d204e0f0b8)
 
 
+## Unsupervised vs. Supervised vs. Semi-supervised cell typing
+InSituType runs in 3 modes:
+- Supervised: call only cell types defined in reference profiles. Set `nclust = 0` to run in fully supervised mode. 
+- Unsupervised: de novo clustering, with no reference cell types
+- Semi-supervised: find new clusters while also calling reference cell types. `Set reference_profiles = NULL` to run in unsupervised mode. 
+
+Considerations for choosing a workflow:
+- Supervised is most convenient if you are confident that your reference profiles contain all the cell types in your dataset. 
+However, many reference profiles from scRNA-seq don't fit spatial data well, so using reference profiles can be challenging. 
+- Semi-supervised mode is the most powerful but most challenging workflow. We use this in >80% of analyses. 
+ Success hinges on how well the reference profiles are calibrated to spatial data. InSituType tries to
+ perform this calibration using anchor cells, but this does not always succeed. 
+- We recommend trying semi-supervised cell typing first, assuming there are new clusters you expect to discover. 
+- Unsupervised has no difficulty with poorly-calibrated reference profiles, but it requires you to name each cluster, 
+ which can be onerous. It may also fail to define distinctions that are important to you.
+
+## Choosing reference profiles
+Keep in mind the following when selecting reference profiles:
+- Quality of scRNA-seq references varies greatly. Finding mis-annotated cell types is not uncommon,
+and for smaller datasets, profiles of rare cell types will be noisy. Exercsie some skepticism. 
+- Large platform effects separate scRNA-seq and spatial platforms. When possible, use a reference from the same platform as your data.
+- A large collection of single cell references can be found here: https://github.com/Nanostring-Biostats/cellprofilelibrary
+- A growing collection of CosMx references is here: https://github.com/Nanostring-Biostats/CosMx-Cell-Profiles
+
+
 ## Choosing nclust
-We recommend choosing a slightly generous value of nclust, then using refineClusters to condense the resulting clusters. For example, if you're running semi-supervised cell typing and you expect to find 5 new clusters, set nclust = 8. Or for unsupervised clustering with an expectation of 12 cell types, set nclust = 16. 
+We recommend choosing a slightly generous value of `nclust`, then using `refineClusters` to condense the resulting clusters. For example, if you're running semi-supervised cell typing and you expect to find 5 new clusters, set `nclust = 8`. Or for unsupervised clustering with an expectation of 12 cell types, set `nclust = 16`. 
 It's generally easy to tell when two clusters come from the same cell type: they'll be adjacent in UMAP space, and the flightpath plot will show them frequently confused with each other. 
 
 Final note: Insitutype splits big clusters with higher counts more aggressively than other clusters. For example, in a tumor study, it will subcluster tumor cells many times before it subclusters e.g. fibroblasts. The simplest solution is to increase nclust as needed, then condense the over-clustered cell type as desired. 
@@ -34,9 +59,11 @@ We suggest using the below flowchart to choose from among these options:
 
 ![image](https://github.com/Nanostring-Biostats/InSituType/assets/4357938/824dec47-2221-4fe8-92a0-15693c749d55)
 
+For more on starting with a coarse reference then subclustering, see the "Targeted subclustering" discussion further on. 
+
 ## Confidence Scores
 Insitutype returns a posterior probability for each cell type call. In practice, we have found these probabilities to be overconfident. 
-Here's an image from the preprint demonstrating this phenomenon:
+Below is an image from the preprint demonstrating this phenomenon. For various posterior probability bins, it shows the accuracy rate actually achieved (with a confidence interval). 
 
 ![image](https://github.com/Nanostring-Biostats/InSituType/assets/4357938/f02df11d-405b-411d-8049-4ab3d021d0a4)
 

diff --git a/InSituType.Rproj b/InSituType.Rproj
@@ -0,0 +1,17 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
+
+BuildType: Package
+PackageUseDevtools: Yes
+PackageInstallArgs: --no-multiarch --with-keep.source
diff --git a/NAMESPACE b/NAMESPACE
@@ -5,15 +5,23 @@ export(Mstep)
 export(chooseClusterNumber)
 export(choose_anchors_from_stats)
 export(colorCellTypes)
+export(estimatePlatformEffects)
 export(fastCohorting)
-export(fast_lldist)
 export(find_anchor_cells)
 export(flightpath_layout)
 export(flightpath_plot)
+export(getProteinParameters)
+export(getRNAprofiles)
+export(getSpatialContext)
 export(get_anchor_stats)
 export(insitutype)
 export(insitutypeML)
+export(lls_protein)
+export(lls_rna)
+export(numCores)
+export(refineAnchors)
 export(refineClusters)
+export(spatialUpdate)
 export(updateProfilesFromAnchors)
 export(updateReferenceProfiles)
 exportMethods(insitutype)
@@ -23,28 +31,45 @@ import(ggplot2)
 importFrom(Matrix,colSums)
 importFrom(Matrix,rowMeans)
 importFrom(Matrix,rowSums)
+importFrom(Matrix,sparseMatrix)
 importFrom(Matrix,t)
 importFrom(Rcpp,evalCpp)
 importFrom(SingleCellExperiment,SingleCellExperiment)
 importFrom(SummarizedExperiment,assay)
+importFrom(data.table,data.table)
+importFrom(data.table,melt)
+importFrom(data.table,rbindlist)
+importFrom(dplyr,filter)
+importFrom(dplyr,group_by)
+importFrom(dplyr,summarise_all)
 importFrom(grDevices,col2rgb)
 importFrom(grDevices,colors)
 importFrom(graphics,lines)
 importFrom(graphics,par)
 importFrom(graphics,plot)
+importFrom(irlba,irlba)
 importFrom(irlba,prcomp_irlba)
 importFrom(lsa,cosine)
+importFrom(magrittr,"%>%")
 importFrom(mclust,Mclust)
 importFrom(mclust,mclustBIC)
 importFrom(mclust,predict.Mclust)
 importFrom(methods,as)
 importFrom(methods,is)
 importFrom(rlang,.data)
 importFrom(scales,alpha)
+importFrom(spatstat.geom,closepairs)
+importFrom(spatstat.geom,nncross)
+importFrom(spatstat.geom,nndist)
+importFrom(spatstat.geom,nnwhich)
+importFrom(spatstat.geom,ppp)
 importFrom(stats,dnbinom)
 importFrom(stats,lm)
 importFrom(stats,qnorm)
 importFrom(stats,rnorm)
+importFrom(tibble,column_to_rownames)
+importFrom(tibble,rownames_to_column)
 importFrom(umap,umap)
 importFrom(utils,data)
+importFrom(uwot,umap_transform)
 useDynLib(InSituType, .registration = TRUE)
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,38 @@
+# InSituType 2.0.0
+
+* Enable use in protein datasets via the assay_type argument. This required a major overhaul under the hood, but has little impact on existing RNA workflows. 
+* More advanced methods for updating reference profiles via anchor cells, implemented in `updateReferenceProfiles`.
+* New function `spatialUpdate` for using alternative data types (e.g. space or immunofluorescence) and the Insitutype likelihood framework to update cell typing results from any method. 
+* New function `getSpatialContext` for conveniently calculating cells' spatial contexts / neighborhood expression. 
+* New functions `getRNAprofiles` and `getProteinParameters`, which serve as user-facing tools for getting profile matrices. 
+
+# InSituType 1.2.3
+
+* handle collinearity issues with fastCohorting:
+Reduce to 2 PC's.
+If this fails, then try successively smaller # of cohorts with the 2 pc's.
+
+# InSituType 1.2.2
+
+* Add Compatibility of assay_type and platform effect correction 
+
+# InSituType 1.2.1
+
+* Create "undefined" profile for cells with zero counts
+
+# InSituType 1.2.0
+
+* Also cluster continuous data from protein assay
+
+# InSituType 1.1.1
+
+* Support platform effect correction
+* Support anchor refinement via UMAP projection 
+
+# InSituType 1.1.0
+
+* Support matrices with more than 4B elements
+
 # InSituType 1.0.0
 
 * License updated

diff --git a/R/RcppExports.R b/R/RcppExports.R
@@ -16,7 +16,25 @@
 #' @importFrom Rcpp evalCpp
 #' @exportPattern "^[[:alpha:]]+" 
 #' @export
-fast_lldist <- function(mat, bgsub, x, bg, size_dnb) {
-    .Call(`_InSituType_fast_lldist`, mat, bgsub, x, bg, size_dnb)
+lls_rna <- function(mat, bgsub, x, bg, size_dnb) {
+    .Call(`_InSituType_lls_rna`, mat, bgsub, x, bg, size_dnb)
+}
+
+#' sum from Gaussian density function
+#'
+#' Probability density function of the Gaussian distribution (written in C++)
+#'
+#' @param mat dgCMatrix expression matrix
+#' @param bgsub vector of background expression per cell
+#' @param x numeric expression for reference profiles
+#' @param xsd numeric expression for reference SD profiles
+#' 
+#' @return rowSums for matrix of densities
+#' @useDynLib InSituType, .registration = TRUE
+#' @importFrom Rcpp evalCpp
+#' @exportPattern "^[[:alpha:]]+" 
+#' @export
+lls_protein <- function(mat, bgsub, x, xsd) {
+    .Call(`_InSituType_lls_protein`, mat, bgsub, x, xsd)
 }
 
diff --git a/R/chooseClusterNumber.R b/R/chooseClusterNumber.R
@@ -3,9 +3,11 @@
 #' For a subset of the data, perform clustering under a range of cluster numbers.
 #'  Report on loglikelihood vs. number of clusters, and suggest a best choice.
 #' @param counts Counts matrix, cells * genes. 
-#' @param neg Vector of mean negprobe counts per cell
+#' @param neg Vector of mean negprobe counts per cell (default = "rna")
+#' @param assay_type Assay type of RNA, protein 
 #' @param bg Expected background
 #' @param fixed_profiles Matrix of cluster profiles to hold unchanged throughout iterations.
+#' @param fixed_sds Matrix of SDs expression of genes x cell types,to hold unchanged throughout iterations. Only for assay_type of protein
 #' @param cohort Vector of cells' cohort assignments. 
 #' @param init_clust Vector of initial cluster assignments.
 #' @param n_clusts Vector giving a range of cluster numbers to consider.
@@ -35,13 +37,16 @@
 #' }
 #' @examples
 #' data("mini_nsclc")
-#' chooseClusterNumber(mini_nsclc$counts, Matrix::rowMeans(mini_nsclc$neg),
+#' chooseClusterNumber(mini_nsclc$counts, Matrix::rowMeans(mini_nsclc$neg), assay_type="RNA",
 #'  n_clust = 2:5)
+
 chooseClusterNumber <-
   function(counts,
            neg,
+           assay_type = c("rna", "protein"),
            bg = NULL,
            fixed_profiles = NULL,
+           fixed_sds = NULL,
            cohort = NULL,
            init_clust = NULL,
            n_clusts = 2:12,
@@ -53,6 +58,7 @@ chooseClusterNumber <-
            pct_drop = 0.005,
            min_prob_increase = 0.05,
            ...) {
+    assay_type <- match.arg(tolower(assay_type), c("rna", "protein"))  
 
   # infer bg if not provided: assume background is proportional to the scaling factor s
   s <- rowSums(counts)
@@ -82,6 +88,7 @@ chooseClusterNumber <-
     sharedgenes <- intersect(rownames(fixed_profiles), colnames(counts))
     counts <- counts[, sharedgenes]
     fixed_profiles <- fixed_profiles[sharedgenes, ]
+    fixed_sds <- fixed_sds[sharedgenes, ]
   }  
   # cluster under each value of n_clusts, and save loglik:
   totallogliks <- sapply(n_clusts, function(x) {
@@ -97,18 +104,23 @@ chooseClusterNumber <-
       neg = neg, 
       bg = bg, 
       fixed_profiles = fixed_profiles,
+      fixed_sds = fixed_sds, 
       cohort = cohort,
       init_clust = tempinit,
       nb_size = nb_size,
+      assay_type=assay_type,
       pct_drop = pct_drop,
       min_prob_increase = min_prob_increase,
       max_iters = max_iters)  
 
     # get the loglik of the clustering result:
     loglik_thisclust <- lldist(x = tempclust$profiles,
                                mat = counts,
+                               xsd = tempclust$sds,
                                bg = bg,
-                               size = nb_size)
+                               size = nb_size,
+                               assay_type = assay_type)
+
     total_loglik_this_clust <- sum(apply(loglik_thisclust, 1, max))
     return(total_loglik_this_clust)
   })

diff --git a/R/colorCellTypes.R b/R/colorCellTypes.R
@@ -26,10 +26,11 @@
 #'  n_phase2 = 500,
 #'  n_phase3 = 2000,
 #'  n_starts = 1,
-#'  max_iters = 5
+#'  max_iters = 5,
+#'  assay_type="RNA"
 #' ) # choosing inadvisably low numbers to speed the vignette; using the defaults in recommended.
 #' colorCellTypes(freqs = table(unsup$clust), palette = "brewers")
-#' 
+
 colorCellTypes <- function(names = NULL, freqs = NULL, init_colors = NULL, max_sum_rgb = 600, 
                            palette = "earthplus") {