vignettes/NSCLC-clustering-vignette.Rmd

---
title: "Clustering a small CosMx dataset with Insitutype"
output: 
  rmarkdown::html_vignette: 
    toc: true
fig_width: 10 
fig_height: 6 
vignette: >
  %\VignetteIndexEntry{Clustering a small CosMx dataset with Insitutype}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
<style>
p.caption {
  font-size: 1.5em;
}
</style>
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
### Installation
```{r installation, eval=FALSE}

# TBD

```   
### Overview
This vignette demonstrates the use of the insitutype algorithm to cluster [CosMx](https://www.nature.com/articles/s41587-022-01483-z) data. CosMx is a spatial molecular imager for sub-cellular resolution of mRNA transcripts and protein targets. We analyze a subset of the [CosMx NSCLC showcase dataset](https://nanostring.com/cosmx-dataset). 

### Data preparation
First, we load the package and access the example data:
```{r setup}
library(InSituType)
options(mc.cores = 1)
data("ioprofiles")
data("iocolors")
data("mini_nsclc")
str(mini_nsclc)
set.seed(0)
```
#### Necessary inputs
First, let us look at the data we need. 
1. A matrix of counts data, cells x genes:
```{r strcounts}
counts <- mini_nsclc$counts
str(counts)
counts[25:30, 9:14]

```
2. A vector giving each cell mean negative control value:
```{r strnegmean}
negmean <- Matrix::rowMeans(mini_nsclc$neg)
head(negmean)
```
3. Optionally, a vector of each cell expected background counts per gene. If not provided, insitutype will estimate it from the negative controls. 
#### Incorporating additional data types
Single-cell spatial transcriptomics datasets offer data types beyond simple gene expression, from single-cell images to spatial contexts. 
Insitutype digests this information in the form of "cohorts" of cells with similar data from the alternative data types. 
Then, when assigning cell types, Insitutype will considers the frequency of each cell type in each cohort. 
For example, if you have a cohort of mainly CD45- cells, Insitutype will see that immune cells are 
very uncommon within this cohort, and it will give cells in this cohort lower probabilities of 
falling in immune cell clusters. (Insitutype doe not know about CD45 staining; it just considers the 
frequency of each cluster in each cohort.)

Cohort definition is left to the user, and many approaches are reasonable. 
For example, you might:

* Define cohort by gating cells immunofluorescence stains, e.g. PanCK+/- or CD45+/-
* Cluster cells on their continuous immunofluorescence values
* Use an autoencoder to extract features from cells images, then cluster on those features. 
* Use the results of spatial clustering / "niches" to define cohorts
* Cluster on information from multiple data types together, such as the average gene expression 
 profile of cell neighbors AND immunofluorescence. 

Cohorting is a low-risk undertaking, so it is recommended whenever possible. 
A poorly-defined cohort will simply be uninformative; 
 it will not send the clustering algorithm astray. 
The more cohorts you can define, and the higher-quality those cohorts, the more information
 Insitutype will be able to gleam from the alternative data types. 

One warning: **do not define cohorts based on variables you want to study.**
For example, if you wish to study how cell populations shift across spatial context, 
 then spatial context must not be allowed to inform cell typing. To do so would introduce circularity. 

In the below example, we will define cohorts using the "fastCohorting" function. 
 This function is provided as a convenient way to derive cohorts from diverse data types; it makes no claim to being optimal. 

```{r cohorting, echo = TRUE}
# simulate immunofluorescence data: 
immunofluordata <- matrix(rpois(n = nrow(counts) * 4, lambda = 100), 
                          nrow(counts))
# perform automatic cohorting:
cohort <- fastCohorting(immunofluordata,
                        gaussian_transform = TRUE) 
# ("Gaussian_transform = TRUE" maps variables to gaussians in order to 
#  place dramatically different variables on the same scale.)
table(cohort)
```

### Unsupervised clustering:

The below command performs unsupervised clustering. 
We will let the algorithm try to select the optimal number of clusters from 
a range of plausible options using the n_clusts argument. 

Note: this should take a few minutes on a standard laptop. 

```{r unsup, echo=TRUE}
unsup <- insitutype(
  x = counts,
  neg = negmean,
  cohort = cohort,
  # Enter your own per-cell background estimates here if you have them;
  # otherwise insitutype will use the negprobes to estimate background for you.
  bg = NULL,
  # condensed to save time. n_clusts = 5:20 would be more optimal
  n_clusts = c(10, 12),
  # NULL value runs unsupervised clustering; entering a matrix here would run
  # semi-supervised clustering.
  reference_profiles = NULL,
  n_phase1 = 200,
  n_phase2 = 500,
  n_phase3 = 2000,
  n_starts = 1,
  max_iters = 5
) # choosing inadvisably low numbers to speed the vignette; using the defaults in recommended.
```

The function returns 4 useful outputs: a vector of cells cluster assignments, 
a vector of the posterior probability for each cell typing call,
a matrix of cells log-likelihoods under each cluster, 
and a matrix of the mean background-subtracted expression profile of each cluster. 
A fifth output, ignorable for unsupervised clustering, records log-likelihoods under clusters that ended up with no cells assigned to them.

```{r unsup_results, fig.width = 8, fig.height = 5}
str(unsup)
round(head(unsup$prob), 2)
heatmap(sweep(unsup$profiles, 1, pmax(apply(unsup$profiles, 1, max), .2), "/"), scale = "none",
        main = "Cluster mean expression profiles")

# define cluster colors:
cols <-
  c(
    '#8DD3C7',
    '#BEBADA',
    '#FB8072',
    '#80B1D3',
    '#FDB462',
    '#B3DE69',
    '#FCCDE5',
    '#D9D9D9',
    '#BC80BD',
    '#CCEBC5',
    '#FFED6F',
    '#E41A1C',
    '#377EB8',
    '#4DAF4A',
    '#984EA3',
    '#FF7F00',
    '#FFFF33',
    '#A65628',
    '#F781BF',
    '#999999'
  )
cols <- cols[seq_along(unique(unsup$clust))]
names(cols) <- unique(unsup$clust)

par(mfrow = c(1, 2))
par(mar = c(0, 0, 3, 0))

plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
          main = "cells in physical space",
     col = cols[unsup$clust], xlab = "", ylab = "", xaxt = "n", yaxt = "n")

plot(mini_nsclc$umap, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
     main = "cells in UMAP space",     
     col = cols[unsup$clust], xlab = "", ylab = "", xaxt = "n", yaxt = "n")
legend("bottomleft", pch = 16, col = cols, legend = names(cols), cex = 0.7)


```
### Visualizing clustering results
Insitutype calculates the probability of each cell belonging to each cluster. 
This information lets us understand how clusters relate to each other. 
The flightpath_plot function arrays cells according to their cluster probabilities:
```{r flightpath, fig.width = 5, fig.height = 5}
# define colors:
cols <-
  c(
    '#8DD3C7',
    '#BEBADA',
    '#FB8072',
    '#80B1D3',
    '#FDB462',
    '#B3DE69',
    '#FCCDE5',
    '#D9D9D9',
    '#BC80BD',
    '#CCEBC5',
    '#FFED6F',
    '#E41A1C',
    '#377EB8',
    '#4DAF4A',
    '#984EA3',
    '#FF7F00',
    '#FFFF33',
    '#A65628',
    '#F781BF',
    '#999999'
  )
cols <- cols[seq_along(unique(unsup$clust))]
names(cols) <- unique(unsup$clust)

# make the flightpath plot
fp <- flightpath_plot(flightpath_result = NULL, insitutype_result = unsup, col = cols[unsup$clust])
class(fp)
print(fp)


```
In the above plot, a cell atop a cluster centroid is near-certain, while cells 
 halfway between two centroids have 50% posterior probability of belonging to each of those two clusters. 
The number by each centroid gives the average posterior probability of cells in the cluster. 
Expect to see this value above 0.9 for clustering major cell type, and above 0.75 when clustering apart closely-related cell types. 
Clusters with average posterior probabilities are unreliable. You can either interpret them with caution,
 or you can dissolve them and reassign their cells to other clusters. (The next section covers this operation.)

Here we can see that clusters f is sometimes confused with clusters g and e (though at a very low rate).

You can also extract the flightpath layout for plotting on your own:

```{r fp_layout}
# compute the flightpath layout:
fp_layout <- flightpath_layout(logliks = unsup$logliks, profiles = unsup$profiles)
str(fp_layout)
# plot it:
#par(mar = c(0,0,0,0))
#plot(fp_layout$cellpos, pch = 16, cex = 0.5, col = cols[unsup$clust], xaxt = "n", yaxt = "n", xlab = "", ylab = "")
#text(fp_layout$clustpos, rownames(fp_layout$clustpos), cex = 0.5)
```

#### Updating clustering results 

After close review, you might want to update your clustering results. 

The function refineCells can modify your results in the following ways:

* merge closely-related or frequently-confused clusters 
* delete clusters (Sometimes a "catch-all" cluster arises that is frequently confused wiht multiple other clusters. It is best to delete these clusters and let their cells get reassigned to meaningful clusters.)
* sub-cluster large clusters

Here we use the refineClusters function to perform all these operations. 
We also use its "merges" argument to rename cluster c. 

```{r refineClusters, fig.width = 5, fig.height = 5}
# refine the clusters:
newclusts <- refineClusters(logliks = unsup$logliks,
                            merges = c("g" = "newcluster", "e" = "newcluster", "c" = "c_renamed"), 
                            to_delete = "a",
                            subcluster = c("f" = 2),
                            counts = counts,
                            neg = negmean) 
str(newclusts)

# plot the updated results:
cols <-
  c(
    '#8DD3C7',
    '#BEBADA',
    '#FB8072',
    '#80B1D3',
    '#FDB462',
    '#B3DE69',
    '#FCCDE5',
    '#D9D9D9',
    '#BC80BD',
    '#CCEBC5',
    '#FFED6F',
    '#E41A1C',
    '#377EB8',
    '#4DAF4A',
    '#984EA3',
    '#FF7F00',
    '#FFFF33',
    '#A65628',
    '#F781BF',
    '#999999'
  )
cols <- cols[seq_along(unique(newclusts$clust))]
names(cols) <- unique(newclusts$clust)
fp <- flightpath_plot(flightpath_result = NULL, insitutype_result = newclusts, col = cols[newclusts$clust])
class(fp)
print(fp)


```


### Coloring cell types

The colorCellTypes function attempts to choose good colors for your cell types. 
Some of its options use the frequency of cell types in this decision. 
Here are 3 versions:

```{r colorCellTypes, fig.width = 8, fig.height = 5}


par(mfrow = c(1, 3))
par(mar = c(0, 0, 3, 0))

plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
          main = "brewers palette",
     col = colorCellTypes(freqs = table(unsup$clust), palette = "brewers")[unsup$clust], 
     xlab = "", ylab = "", xaxt = "n", yaxt = "n")

plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
          main = "tableau20 palette",
     col = colorCellTypes(freqs = table(unsup$clust), palette = "tableau20")[unsup$clust], 
     xlab = "", ylab = "", xaxt = "n", yaxt = "n")

plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
          main = "earthplus palette",
     col = colorCellTypes(freqs = table(unsup$clust), palette = "earthplus")[unsup$clust], 
     xlab = "", ylab = "", xaxt = "n", yaxt = "n")

```

### Session Info

```{r sessioninfo}
sessionInfo()
```