What do cell profiles tell us about biology and disease?

What do cell profiles tell us about biology and disease?
User tutorial
Data management
CLI command reference
- Dataset uploader
API reference
Testing, development, and maintenance
Deployment options
- License

What do cell profiles tell us about biology and disease?

By studying microscopic images of specimens of tissue, like skin or organ resections, pathologists and scientists draw inferences about the way that cells coordinate to set biological processes in motion and how these processes are disrupted in the course of disease.

The taxonomy of cell types and their functional states is surprisingly diverse, and modeling biological processes at the cellular level is consequently a rich source of new insights. Imaging methods are needed that capture some of this diversity, by measuring multiple channels of information at the same time for each cell, to provide empirical data that ensures this modeling makes sense in realistic scenarios.

Multiple-channel imaging technology capable of measuring several dozen protein targets is reaching maturity. Multiplexed immunofluoresence, imaging mass cytometry, and their variants measure data similar to what is measured by flow cytometry or single-cell RNA-seq, since this is also at the single-cell level and involves multiple quantitative features, but with the crucial advantage that cell positions are also observed -- spatial context.

The Spatial Multiomics Profiler (SMProfiler) project is about making the most of this informative data source. The guiding principles are:


⚡	High availability	Datasets should be available for analysis immediately with the widest range of tools. Preprocessing and indexing should be done in advance as much as possible.
🔁	Reproducible analysis	Results and findings should be based on analyses that others can easily recreate in their entirety.
💻	No code	The tools should be usable by investigators without doing any programming and without the need for specialized knowledge of computer systems.
✅	Uniform data management	Datasets should be organized with high semantic integrity, to ensure that analysis can be performed on them in a consistent way and that the conclusions drawn are valid.

SMProfiler is available to the public at smprofiler.io.

User tutorial

Example: Exploratory data analysis of immunotherapy response in melanoma

Select a study
Choose cell phenotypes
Aggregate cell population fractions and overlaps
Check per-sample values
Assess phenotype fractions between cohorts
Assess ratios between cohorts
Open slide viewer
Region selection and UMAP visualization

1. Select a study

On the main page, select Melanoma CyTOF ICI. This brings up a dataset that was collected and published by Moldoveanu et al.¹.

You'll see a summary of this dataset, including the numbers of samples, cells, and channels, links to relevant publications, classification of the samples, and highlighted findings that can be observed by using the SMProfiler application. In this case the study collected samples from patients treated with immune-checkpoint inhibitor therapy, and the patients either responded favorably or poorly to this treatment.

2. Choose cell phenotypes

On the next page you can choose which cell phenotypes you want to focus on. Click one of the pre-defined phenotypes, or define a custom phenotype by indicating positive and negative markers from among the channels which were imaged.

We select five custom phenotypes. The first phenotype, for example, was defined by clicking the + beside CD3+, then clicking Add to selection. This generally indicates the T cells. The second phenotype is CD3+ CD4+, the markers of T helper cells. We also include: CD3+ CD8A+, CD3+ CD4+ FOXP3+, and CD20+ CD3-. We are ascertaining the rough profile of lymphocytes in the dataset.

3. Aggregate cell population fractions and overlaps

The next page shows the cell population breakdown with respect to the phenotypes we've just selected. Each phenotype is shown with the fraction of cells expressing that phenotype across all samples, for example 54.02% are indicated as T cells.

In the grid, each pair of phenotypes is shown with the fraction of cells expressing both phenotypes. For example, the fraction of cells that are both CD3+ CD4+ FOXP3+ and CD3+ is 16.53%, the same as the fraction of cells that are CD3+ CD4+ FOXP3+, as expected since CD3+ is part of the signature of this phenotype (the T regulatory cells).

Note

📊 You could use this technique to make a standard heat map for assessment of clusters, by selecting all single-channel phenotypes. Since these metrics are computed live, depending on the size of the samples and the number of selected markers, this could take a few minutes.

4. Check per-sample values

To continue with a finer analysis, click one of the "tiles", either for one phenotype (the tiles on the left) or two phenotypes (the grid on the right).

We choose the tile at row CD3+ CD4+ FOXP3+ (Treg) and column CD3+ CD8A+ (Tc). The table below populates with the size of the population of cells expressing both signatures, broken down by sample. Note that in reality there are generally few cells expressing both of these two specific suites of markers, and the few cells occuring here are probably the result of an imperfect stain intensity dichotomization (thresholding, gating). So this tool can be used to do basic quality control in case some logical or illogical marker combinations are known in advance.

We also selected the single-phenotype tiles CD3+ CD4+ FOXP3+ and CD3+ CD8A.

5. Assess phenotype fractions between cohorts

Click on the column header CD3+ CD8A+ (it becomes underlined to indicate that it is selected). Then select the two cohorts by clicking one of the 1 values and one of the 2 values. A "verbalization" appears which states that the trend, according to a t-test, is that the fraction of Tc cells is increased about 1.5 times in the non-responder cohort compared to the responders, with statistical significance value p=0.1.

6. Assess ratios between cohorts

We click on column CD3+ CD4+ FOXP3+, in addition to the prior selection. A similar assessment appears, this time with respect to the ratio of the number of CD3+ CD8A+ (the first selection) to CD3+ CD4+ FOXP3+ (the second selection).

7. Open slide viewer

Let's focus our attention on one of the samples that exhibited a large fraction of Tc cells. Click 31RD.

The "virtual slide viewer" opens. Choose a few phenotypes, and the corresponding cells will become highlighted. The fraction and count of the cells for each phenotype are shown.

8. Region selection and UMAP visualization

A UMAP dimensional reduction of the cell set across the whole data collection is available in this case. Click UMAP.

Note

🔍 You can zoom and pan the view using scroll and click-and-drag.

We spot a region that looks "saturated" with Tc cells. Select it by clicking and dragging the mouse while holding either the Ctrl key or (on Mac) CMD.

The new cell count for each phenotype is now shown, together with the new percentage, relative to the selection. In this case the Tc fraction approximately doubled, to 5659 cells (shown in green). This increase is assessed using the Fisher test (the entire contingency table is also shown, for reference). The test verifies that the increase is highly statistically significant in this case, as expected.

Note

By careful use of the selection tool, noting enrichments in each virtual region, you can account for most of the cell types present and hone the focus of study.

Example: Spatially-informed metrics

Compute a cell-set-to-cell-set proximity metric in realtime
Save and share results

1. Compute a cell-set-to-cell-set proximity metric in realtime

Let's see an example of quantification over samples that makes use of the spatial arrangement of cells.

Using the same dataset as the previous example, Melanoma CyTOF ICI, choose the phenotypes Naive cytotoxic T cell and T helper cell antigen-experienced. Select the tile with row T helper cell antigen-experienced and column Naive cytotoxic T cell, representing the pair of phenotypes.

In the column header that appears, click >. The spatial metrics dropdown appears. Click v to show the available metrics. Choose cell-to-cell proximity. After the metric is finished computing, click the column header cell-to-cell proximity and the two cohorts 1 and 2 to perform a univariate comparison.

This metric is the average number of Naive cytotoxic T cells appearing within a specified radius of given T helper antigen-experienced cells. It measures generally how common it is to find cells of one phenotype in close proximity to those of another phenotype. There are several other metrics available, of various degrees of statistical sophistication, many computed using the Squidpy package. These are explained in more detail in the API documentation.

2. Save and share results

📋 You can share or save results like this for later by copying the URL in the address bar. In fact, this result is highlighted on the study summary page. Try reproducing it by following the first link as shown below.

Example: Submit significant results with attribution

Identify a specific result
Submit for attribution using ORCID
Review contribution on study page

1. Identify a specific result

The SMProfiler supports extensive secondary analysis with features computed ondemand that you request during exploratory data analysis. So there is a good chance that you will discover a significant new result relating specific cell phenotypes or spatial statistics to treatment response or prognosis.

Most of the time that you are using the application, you can use the URL as described above in Save and share results to share what you have found with others.

Note

However for the precise, quantitative results found in the analysis tab, with statistical assessment, you can also submit your finding for official recognition on the summary page for the study.

2. Submit for attribution using ORCID

You can send your contribution using Submit significant result. ORCID researcher identifiers are used to attribute the result to you.

3. Review contribution on study page

Once your contribution has been checked by the SMProfiler team, it is posted on the summary page for the study as in the example below.

Example: B cell aggregation in colon cancer

Observe tissue geometry patterning
Assess region enrichment with Fisher test

1. Observe tissue geometry patterning

Select study HTAN Orion CRC and phenotypes:

T cytotoxic
Epithelium
B cell

Review slide C12 and observe the differing tissue localization.

2. Assess region enrichment with Fisher test

A cluster of B cells is apparent, which we can assess by selecting this region with the drawing tool.

The assessment shows 25% baseline prevalence of B cells in this slide, elevated to 76% in the selected region. The Fisher test contingency table is shown.

Example: Channel intensity for phenotyping in bone marrow

Select the Bone marrow aging study ², channel CD61 with additional phenotype Megakaryocyte, and in the Slide Viewer select sample WCM10. In this study, a detailed model was trained to detect specific cell types from a number of imaging features. Megakaryocytes were associated with elevated CD61 levels, and in this example we can compare the Megakaryocyte assignments with the CD61 expression levels by using the channel intensity threshold adjustment.

Example: Graph Neural Network detects motifs in immunotherapy non-responders

For this example, select study Melanoma intralesional IL2. The study assesses responders and non-responders to interleukin-2 injection immunotherapy.

1. Train the GNN

The information provided by spatial cell position can be used to define graphs or networks of neighboring cells. These cell graphs together with the multiple marker quantification, and response indication per-patient, are used to train a Graph Neural Network to predict this response. For more information on creating these models, see the documentation.

2. Profile of pertinent sub-networks by cohort

The 100 cells in each slide that are most important to the classification, according to the GNN model, were assessed for phenotype composition. In the plot below, the results of a Fisher test for over-representation are shown, for each phenotype and each slide. The circle size indicates p-value, ranging from p=0 for the largest radius to p=0.05 for the smallest radius 0. The red color indicates the fraction of the 100 most-important cells in the sample which belong to the given phenotype, with 100% corresponding to greatest saturation and 0% corresponding to white.

Some of the most significant patterns are high over-representation of Adipocytes or Langerhans cells (defined largely by S100B), among the non-responders.

Whole-database assessments for outcome associations

Using all single marker cell phenotypes, frequency
Using all marker pairs, spatial proximity

The datasets transformed and curated for the SMProfiler database are well-harmonized with each other, so that cross-cutting queries and whole-database surveys are readily performed.

1. Using all single marker cell phenotypes, frequency

In total around 50 markers were imaged across the 12 studies currently available. The all-markers overview assesses each marker for its utility in discriminating between key outcome cohorts within each given dataset, using the fractions of the cell set expressing the marker. The t-test provides a sense of the overall strength and statistical significance of any association found. In the plot (previewed below), the colors correspond to one sample cohort and relative size of a circle pair indicates the effective differential between the two cohorts using the given marker.

For example, in the row for antigen-experience-indicating marker CD45RO for the Head and neck mpIF study column, the circles plot (green and gray) shows that about 50% more experienced lymphocytes are found in the samples from patients who will clear the disease (cancer) compared with patients who will not.

2. Using all marker pairs, spatial proximity

We can involve the spatial context in our whole-database assessment by computing the cell-set-to-cell-set proximity metric for each pair of markers. It is expected that the cohort discriminations provided by such marker pairs augments the results identified using dissociated cell sets defined by single markers (the fractions features), since these are based on an independent source of information.

As an example, in the row for CD31 (typically indicating endothelium) and column FOXP3 (indicating T regulatory cells), the circle plot (light and dark red) suggests that T regs are found near endothelial cells about twice as frequently in glioblastomas of patients who will not survive to 1 year post-surgery, compared with those surviving 1-3 years.

Data management

To support this project's semantic integrity goals, we designed a general data model and ontology for cell-resolved measurement studies, using a schema-authoring system we call the Application Data Interface (ADI) framework.

The schema is called scstudies and it is documented in detail here.

In our implementation, we sought to strike an effective balance between the completeness of annotation demanded by accurate record-keeping, on the one hand, and practicality and computational efficiency on the other. Much of the application is organized around a SQL database with a schema that conforms tightly to the formal scstudies data model, but we also make liberal use of derivative data artifacts to improve speed and performance. For example, a highly-compressed binary format is adopted for transmission of a given sample's cell-feature matrix.

Similarly, datasets that we have curated for uniform data import are stored in a simple tabular file format which does not generally support all the features of the scstudies model. This intermediary format is designed for ease of creation and it is not entirely formalized. For an example, see data_curation/.

CLI command reference

The Python package smprofiler is released on PyPI, so it can be installed with

python -m pip install smprofiler

Installation makes several commands available in the shell. List them with smprofiler:

$ smprofiler

smprofiler apiserver dump-schema

smprofiler graphs create-specimen-graphs
smprofiler graphs explore-classes
smprofiler graphs extract
smprofiler graphs finalize-graphs
smprofiler graphs generate-graphs
smprofiler graphs plot-importance-fractions
smprofiler graphs plot-interactives
smprofiler graphs prepare-graph-creation
smprofiler graphs upload-importances

smprofiler db cache-subsample
smprofiler db collection
smprofiler db count-cells
smprofiler db delete-feature
smprofiler db do-fractions-tests
smprofiler db drop
smprofiler db drop-ondemand-computations
smprofiler db guess-channels-from-object-files
smprofiler db interactive-uploader
smprofiler db list-studies
smprofiler db load-testing
smprofiler db retrieve-feature-matrices
smprofiler db review-submissions
smprofiler db status
smprofiler db sync-annotations
smprofiler db upload-sync-small

smprofiler ondemand assess-recreate-cache
smprofiler ondemand start

smprofiler workflow aggregate-core-results
smprofiler workflow configure
smprofiler workflow core-job
smprofiler workflow generate-run-information
smprofiler workflow initialize
smprofiler workflow merge-performance-reports
smprofiler workflow report-run-configuration
smprofiler workflow tail-logs

Each command will print documentation by providing the --help option.

Several commands are mainly for use internal to the application components.

Some others are TUIs (Terminal User Interfaces) meant to make common tasks, like uploading datasets or inspecting cache or metadata, more reliable.

Dataset uploader

smprofiler db interactive-uploader is a TUI that automatically determines available data sources and targets after you have created or located source datasets (format: data_curation/). It looks for database configuration files named ~/.smprofiler_db.config.*, checks the environment variable SMPROFILER_S3_BUCKET, and searches recursively for datasets in the current working directory named generated_artifacts. It presents available options and initiates the upload process.

Example usage is shown below.

The ETL (Extract/Transform/Load) process includes a number of data integrity checks and the creation of several intermediate data artifacts.

API reference

The SMProfiler application is supported by a web API, which provides fine-grained access to specific components of a given dataset. The API is documented here.

Testing, development, and maintenance

See docs/maintenance.md.

Deployment options

For assistance setting up a deployment of the SMProfiler application for your institution or lab, send us an email at nadeems@mskcc.org.

The application can be deployed in several ways:

As manually-managed services on a single server
Using Docker compose
On a Kubernetes cluster using a cloud provider

License

Funding

This work is funded by the 7-year NIH/NCI R37 MERIT Award (R37CA295658).

Moldoveanu et al. Spatially mapping the immune landscape of melanoma using imaging mass cytometry ↩
Sarachakov et al. Spatial mapping of human hematopoiesis at single-cell resolution reveals aging-associated topographic remodeling ↩

Name		Name	Last commit message	Last commit date
Latest commit History 2,308 Commits
analysis_replication		analysis_replication
build		build
data_curation		data_curation
docs		docs
plugin		plugin
smprofiler		smprofiler
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.pylintrc		.pylintrc
.sqliterc		.sqliterc
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.apiserver.txt		requirements.apiserver.txt
requirements.ondemand.txt		requirements.ondemand.txt
requirements.txt		requirements.txt

nadeemlab/smprofiler

Folders and files

Latest commit

History

Repository files navigation

What do cell profiles tell us about biology and disease?

User tutorial

Example: Exploratory data analysis of immunotherapy response in melanoma

1. Select a study

2. Choose cell phenotypes

3. Aggregate cell population fractions and overlaps

4. Check per-sample values

5. Assess phenotype fractions between cohorts

6. Assess ratios between cohorts

7. Open slide viewer

8. Region selection and UMAP visualization

Example: Spatially-informed metrics

1. Compute a cell-set-to-cell-set proximity metric in realtime

2. Save and share results

Example: Submit significant results with attribution

1. Identify a specific result

2. Submit for attribution using ORCID

3. Review contribution on study page

Example: B cell aggregation in colon cancer

1. Observe tissue geometry patterning

2. Assess region enrichment with Fisher test

Example: Channel intensity for phenotyping in bone marrow

Example: Graph Neural Network detects motifs in immunotherapy non-responders

1. Train the GNN

2. Profile of pertinent sub-networks by cohort

Whole-database assessments for outcome associations

1. Using all single marker cell phenotypes, frequency

2. Using all marker pairs, spatial proximity

Data management

CLI command reference

Dataset uploader

API reference

Testing, development, and maintenance

Deployment options

License

Funding

Footnotes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages