-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathNSCLC-clustering-vignette.Rmd
335 lines (286 loc) · 11.2 KB
/
NSCLC-clustering-vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: "Clustering a small CosMx dataset with Insitutype"
output:
rmarkdown::html_vignette:
toc: true
fig_width: 10
fig_height: 6
vignette: >
%\VignetteIndexEntry{Clustering a small CosMx dataset with Insitutype}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
<style>
p.caption {
font-size: 1.5em;
}
</style>
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
### Installation
```{r installation, eval=FALSE}
# TBD
```
### Overview
This vignette demonstrates the use of the insitutype algorithm to cluster [CosMx](https://www.nature.com/articles/s41587-022-01483-z) data. CosMx is a spatial molecular imager for sub-cellular resolution of mRNA transcripts and protein targets. We analyze a subset of the [CosMx NSCLC showcase dataset](https://nanostring.com/cosmx-dataset).
### Data preparation
First, we load the package and access the example data:
```{r setup}
library(InSituType)
options(mc.cores = 1)
data("ioprofiles")
data("iocolors")
data("mini_nsclc")
str(mini_nsclc)
set.seed(0)
```
#### Necessary inputs
First, let us look at the data we need.
1. A matrix of counts data, cells x genes:
```{r strcounts}
counts <- mini_nsclc$counts
str(counts)
counts[25:30, 9:14]
```
2. A vector giving each cell mean negative control value:
```{r strnegmean}
negmean <- Matrix::rowMeans(mini_nsclc$neg)
head(negmean)
```
3. Optionally, a vector of each cell expected background counts per gene. If not provided, insitutype will estimate it from the negative controls.
#### Incorporating additional data types
Single-cell spatial transcriptomics datasets offer data types beyond simple gene expression, from single-cell images to spatial contexts.
Insitutype digests this information in the form of "cohorts" of cells with similar data from the alternative data types.
Then, when assigning cell types, Insitutype will considers the frequency of each cell type in each cohort.
For example, if you have a cohort of mainly CD45- cells, Insitutype will see that immune cells are
very uncommon within this cohort, and it will give cells in this cohort lower probabilities of
falling in immune cell clusters. (Insitutype doe not know about CD45 staining; it just considers the
frequency of each cluster in each cohort.)
Cohort definition is left to the user, and many approaches are reasonable.
For example, you might:
* Define cohort by gating cells immunofluorescence stains, e.g. PanCK+/- or CD45+/-
* Cluster cells on their continuous immunofluorescence values
* Use an autoencoder to extract features from cells images, then cluster on those features.
* Use the results of spatial clustering / "niches" to define cohorts
* Cluster on information from multiple data types together, such as the average gene expression
profile of cell neighbors AND immunofluorescence.
Cohorting is a low-risk undertaking, so it is recommended whenever possible.
A poorly-defined cohort will simply be uninformative;
it will not send the clustering algorithm astray.
The more cohorts you can define, and the higher-quality those cohorts, the more information
Insitutype will be able to gleam from the alternative data types.
One warning: **do not define cohorts based on variables you want to study.**
For example, if you wish to study how cell populations shift across spatial context,
then spatial context must not be allowed to inform cell typing. To do so would introduce circularity.
In the below example, we will define cohorts using the "fastCohorting" function.
This function is provided as a convenient way to derive cohorts from diverse data types; it makes no claim to being optimal.
```{r cohorting, echo = TRUE}
# simulate immunofluorescence data:
immunofluordata <- matrix(rpois(n = nrow(counts) * 4, lambda = 100),
nrow(counts))
# perform automatic cohorting:
cohort <- fastCohorting(immunofluordata,
gaussian_transform = TRUE)
# ("Gaussian_transform = TRUE" maps variables to gaussians in order to
# place dramatically different variables on the same scale.)
table(cohort)
```
### Unsupervised clustering:
The below command performs unsupervised clustering.
We will let the algorithm try to select the optimal number of clusters from
a range of plausible options using the n_clusts argument.
Note: this should take a few minutes on a standard laptop.
```{r unsup, echo=TRUE}
unsup <- insitutype(
x = counts,
neg = negmean,
cohort = cohort,
# Enter your own per-cell background estimates here if you have them;
# otherwise insitutype will use the negprobes to estimate background for you.
bg = NULL,
# condensed to save time. n_clusts = 5:20 would be more optimal
n_clusts = c(10, 12),
# NULL value runs unsupervised clustering; entering a matrix here would run
# semi-supervised clustering.
reference_profiles = NULL,
n_phase1 = 200,
n_phase2 = 500,
n_phase3 = 2000,
n_starts = 1,
max_iters = 5
) # choosing inadvisably low numbers to speed the vignette; using the defaults in recommended.
```
The function returns 4 useful outputs: a vector of cells cluster assignments,
a vector of the posterior probability for each cell typing call,
a matrix of cells log-likelihoods under each cluster,
and a matrix of the mean background-subtracted expression profile of each cluster.
A fifth output, ignorable for unsupervised clustering, records log-likelihoods under clusters that ended up with no cells assigned to them.
```{r unsup_results, fig.width = 8, fig.height = 5}
str(unsup)
round(head(unsup$prob), 2)
heatmap(sweep(unsup$profiles, 1, pmax(apply(unsup$profiles, 1, max), .2), "/"), scale = "none",
main = "Cluster mean expression profiles")
# define cluster colors:
cols <-
c(
'#8DD3C7',
'#BEBADA',
'#FB8072',
'#80B1D3',
'#FDB462',
'#B3DE69',
'#FCCDE5',
'#D9D9D9',
'#BC80BD',
'#CCEBC5',
'#FFED6F',
'#E41A1C',
'#377EB8',
'#4DAF4A',
'#984EA3',
'#FF7F00',
'#FFFF33',
'#A65628',
'#F781BF',
'#999999'
)
cols <- cols[seq_along(unique(unsup$clust))]
names(cols) <- unique(unsup$clust)
par(mfrow = c(1, 2))
par(mar = c(0, 0, 3, 0))
plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
main = "cells in physical space",
col = cols[unsup$clust], xlab = "", ylab = "", xaxt = "n", yaxt = "n")
plot(mini_nsclc$umap, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
main = "cells in UMAP space",
col = cols[unsup$clust], xlab = "", ylab = "", xaxt = "n", yaxt = "n")
legend("bottomleft", pch = 16, col = cols, legend = names(cols), cex = 0.7)
```
### Visualizing clustering results
Insitutype calculates the probability of each cell belonging to each cluster.
This information lets us understand how clusters relate to each other.
The flightpath_plot function arrays cells according to their cluster probabilities:
```{r flightpath, fig.width = 5, fig.height = 5}
# define colors:
cols <-
c(
'#8DD3C7',
'#BEBADA',
'#FB8072',
'#80B1D3',
'#FDB462',
'#B3DE69',
'#FCCDE5',
'#D9D9D9',
'#BC80BD',
'#CCEBC5',
'#FFED6F',
'#E41A1C',
'#377EB8',
'#4DAF4A',
'#984EA3',
'#FF7F00',
'#FFFF33',
'#A65628',
'#F781BF',
'#999999'
)
cols <- cols[seq_along(unique(unsup$clust))]
names(cols) <- unique(unsup$clust)
# make the flightpath plot
fp <- flightpath_plot(flightpath_result = NULL, insitutype_result = unsup, col = cols[unsup$clust])
class(fp)
print(fp)
```
In the above plot, a cell atop a cluster centroid is near-certain, while cells
halfway between two centroids have 50% posterior probability of belonging to each of those two clusters.
The number by each centroid gives the average posterior probability of cells in the cluster.
Expect to see this value above 0.9 for clustering major cell type, and above 0.75 when clustering apart closely-related cell types.
Clusters with average posterior probabilities are unreliable. You can either interpret them with caution,
or you can dissolve them and reassign their cells to other clusters. (The next section covers this operation.)
Here we can see that clusters f is sometimes confused with clusters g and e (though at a very low rate).
You can also extract the flightpath layout for plotting on your own:
```{r fp_layout}
# compute the flightpath layout:
fp_layout <- flightpath_layout(logliks = unsup$logliks, profiles = unsup$profiles)
str(fp_layout)
# plot it:
#par(mar = c(0,0,0,0))
#plot(fp_layout$cellpos, pch = 16, cex = 0.5, col = cols[unsup$clust], xaxt = "n", yaxt = "n", xlab = "", ylab = "")
#text(fp_layout$clustpos, rownames(fp_layout$clustpos), cex = 0.5)
```
#### Updating clustering results
After close review, you might want to update your clustering results.
The function refineCells can modify your results in the following ways:
* merge closely-related or frequently-confused clusters
* delete clusters (Sometimes a "catch-all" cluster arises that is frequently confused wiht multiple other clusters. It is best to delete these clusters and let their cells get reassigned to meaningful clusters.)
* sub-cluster large clusters
Here we use the refineClusters function to perform all these operations.
We also use its "merges" argument to rename cluster c.
```{r refineClusters, fig.width = 5, fig.height = 5}
# refine the clusters:
newclusts <- refineClusters(logliks = unsup$logliks,
merges = c("g" = "newcluster", "e" = "newcluster", "c" = "c_renamed"),
to_delete = "a",
subcluster = c("f" = 2),
counts = counts,
neg = negmean)
str(newclusts)
# plot the updated results:
cols <-
c(
'#8DD3C7',
'#BEBADA',
'#FB8072',
'#80B1D3',
'#FDB462',
'#B3DE69',
'#FCCDE5',
'#D9D9D9',
'#BC80BD',
'#CCEBC5',
'#FFED6F',
'#E41A1C',
'#377EB8',
'#4DAF4A',
'#984EA3',
'#FF7F00',
'#FFFF33',
'#A65628',
'#F781BF',
'#999999'
)
cols <- cols[seq_along(unique(newclusts$clust))]
names(cols) <- unique(newclusts$clust)
fp <- flightpath_plot(flightpath_result = NULL, insitutype_result = newclusts, col = cols[newclusts$clust])
class(fp)
print(fp)
```
### Coloring cell types
The colorCellTypes function attempts to choose good colors for your cell types.
Some of its options use the frequency of cell types in this decision.
Here are 3 versions:
```{r colorCellTypes, fig.width = 8, fig.height = 5}
par(mfrow = c(1, 3))
par(mar = c(0, 0, 3, 0))
plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
main = "brewers palette",
col = colorCellTypes(freqs = table(unsup$clust), palette = "brewers")[unsup$clust],
xlab = "", ylab = "", xaxt = "n", yaxt = "n")
plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
main = "tableau20 palette",
col = colorCellTypes(freqs = table(unsup$clust), palette = "tableau20")[unsup$clust],
xlab = "", ylab = "", xaxt = "n", yaxt = "n")
plot(mini_nsclc$x, mini_nsclc$y, pch = 16, cex = .75, asp = 1, cex.main = 0.75,
main = "earthplus palette",
col = colorCellTypes(freqs = table(unsup$clust), palette = "earthplus")[unsup$clust],
xlab = "", ylab = "", xaxt = "n", yaxt = "n")
```
### Session Info
```{r sessioninfo}
sessionInfo()
```