Dimensionality Reduction for Visualization

Reduction of high-dimensional datasets to 2D for visualization & interpretation.

Repository: openproblems-bio/task_dimensionality_reduction

Description

Data visualisation is an important part of all stages of single-cell analysis, from initial quality control to interpretation and presentation of final results. For bulk RNA-seq studies, linear dimensionality reduction techniques such as PCA and MDS are commonly used to visualise the variation between samples. While these methods are highly effective they can only be used to show the first few components of variation which cannot fully represent the increased complexity and number of observations in single-cell datasets. For this reason non-linear techniques (most notably t-SNE and UMAP) have become the standard for visualising single-cell studies. These methods attempt to compress a dataset into a two-dimensional space while attempting to capture as much of the variance between observations as possible. Many methods for solving this problem now exist. In general these methods try to preserve distances, while some additionally consider aspects such as density within the embedded space or conservation of continuous trajectories. Despite almost every single-cell study using one of these visualisations there has been debate as to whether they can effectively capture the variation in single-cell datasets [@chari2023speciousart].

The dimensionality reduction task attempts to quantify the ability of methods to embed the information present in complex single-cell studies into a two-dimensional space. Thus, this task is specifically designed for dimensionality reduction for visualisation and does not consider other uses of dimensionality reduction in standard single-cell workflows such as improving the signal-to-noise ratio (and in fact several of the methods use PCA as a pre-processing step for this reason). Unlike most tasks, methods for the dimensionality reduction task must accept a matrix containing expression values normalised to 10,000 counts per cell and log transformed (log-10k) and produce a two-dimensional coordinate for each cell. Pre-normalised matrices are required to enforce consistency between the metric evaluation (which generally requires normalised data) and the method runs. When these are not consistent, methods that use the same normalisation as used in the metric tend to score more highly. For some methods we also evaluate the pre-processing recommended by the method.

Authors & contributors

name	roles
Luke Zappia	maintainer, author
Michael Vinyard	author
Michal Klein	author
Scott Gigante	author
Ben DeMeo	author
Robrecht Cannoodt	author
Kai Waldrant	contributor
Sai Nirmayi Yasa	contributor
Juan A. Cordero Varela	contributor

API

flowchart TB
  file_common_dataset("<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#file-format-dataset'>Dataset</a>")
  comp_process_dataset[/"<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#component-type-data-processor'>Data processor</a>"/]
  file_dataset("<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#file-format-dataset'>Dataset</a>")
  file_solution("<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#file-format-solution-data'>Solution data</a>")
  comp_control_method[/"<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#component-type-control-method'>Control method</a>"/]
  comp_method[/"<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#component-type-method'>Method</a>"/]
  comp_metric[/"<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#component-type-metric'>Metric</a>"/]
  comp_process_embedding[/"<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#component-type-process-embedding'>Process embedding</a>"/]
  file_embedding("<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#file-format-embedding'>Embedding</a>")
  file_score("<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#file-format-score'>Score</a>")
  file_processed_embedding("<a href='https://github.com/openproblems-bio/task_dimensionality_reduction#file-format-processed-embedding'>Processed Embedding</a>")
  file_common_dataset---comp_process_dataset
  comp_process_dataset-->file_dataset
  comp_process_dataset-->file_solution
  file_dataset---comp_control_method
  file_dataset---comp_method
  file_solution---comp_control_method
  file_solution---comp_metric
  file_solution---comp_process_embedding
  comp_control_method-->file_embedding
  comp_method-->file_embedding
  comp_metric-->file_score
  comp_process_embedding-->file_processed_embedding
  file_embedding---comp_process_embedding
  file_processed_embedding---comp_metric

File format: Dataset

The dataset to pass to a method.

Example file: resources_test/common/cxg_mouse_pancreas_atlas/dataset.h5ad

Format:

AnnData object
 obs: 'cell_type'
 var: 'hvg_score'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'

Data structure:

Slot	Type	Description
`obs["cell_type"]`	`string`	Classification of the cell type based on its characteristics and function within the tissue or organism.
`var["hvg_score"]`	`double`	High variability gene score (normalized dispersion). The greater, the more variable.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.

Component type: Data processor

A dimensionality reduction dataset processor.

Arguments:

Name	Type	Description
`--input`	`file`	The dataset to pass to a method.
`--output_dataset`	`file`	(Output) The dataset to pass to a method.
`--output_solution`	`file`	(Output) The data for evaluating a dimensionality reduction.

File format: Dataset

The dataset to pass to a method.

Example file: resources_test/task_dimensionality_reduction/cxg_mouse_pancreas_atlas/dataset.h5ad

Format:

AnnData object
 var: 'hvg_score'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'normalization_id'

Data structure:

Slot	Type	Description
`var["hvg_score"]`	`double`	High variability gene score (normalized dispersion). The greater, the more variable.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.

File format: Solution data

The data for evaluating a dimensionality reduction.

Example file: resources_test/task_dimensionality_reduction/cxg_mouse_pancreas_atlas/solution.h5ad

Format:

AnnData object
 obs: 'cell_type', 'is_waypoint'
 var: 'hvg_score'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'between_waypoint_distances', 'label_centroids', 'waypoint_centroid_distances', 'between_centroid_distances'

Data structure:

Slot	Type	Description
`obs["cell_type"]`	`string`	Ground truth cell type based on a cells characteristics and function within the tissue or organism.
`obs["is_waypoint"]`	`boolean`	Whether or not this cell is a waypoint used for some metric calculations.
`var["hvg_score"]`	`double`	High variability gene score (normalized dispersion). The greater, the more variable.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalized expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["dataset_name"]`	`string`	Nicely formatted name.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.
`uns["between_waypoint_distances"]`	`double`	Euclidean distances between waypoint cells.
`uns["label_centroids"]`	`double`	Centroid positions of each label in the normalized expression space.
`uns["waypoint_centroid_distances"]`	`double`	Euclidean distances from waypoint cells to label centroids.
`uns["between_centroid_distances"]`	`double`	Euclidean distances between label centroids.

Component type: Control method

Quality control methods for verifying the pipeline.

Arguments:

Name	Type	Description
`--input`	`file`	The dataset to pass to a method.
`--input_solution`	`file`	The data for evaluating a dimensionality reduction.
`--output`	`file`	(Output) A dataset with dimensionality reduction embedding.

Component type: Method

A dimensionality reduction method.

Arguments:

Name	Type	Description
`--input`	`file`	The dataset to pass to a method.
`--output`	`file`	(Output) A dataset with dimensionality reduction embedding.

Component type: Metric

A dimensionality reduction metric.

Arguments:

Name	Type	Description
`--input_embedding`	`file`	A dataset with dimensionality reduction embedding that has been processed to add information required by metrics.
`--input_solution`	`file`	The data for evaluating a dimensionality reduction.
`--output`	`file`	(Output) Metric score file.

Component type: Process embedding

A dimensionality reduction embedding processor.

Arguments:

Name	Type	Description
`--input_embedding`	`file`	A dataset with dimensionality reduction embedding.
`--input_solution`	`file`	The data for evaluating a dimensionality reduction.
`--output`	`file`	(Output) A dataset with dimensionality reduction embedding that has been processed to add information required by metrics.

File format: Embedding

A dataset with dimensionality reduction embedding.

Example file: resources_test/task_dimensionality_reduction/cxg_mouse_pancreas_atlas/embedding.h5ad

Format:

AnnData object
 obsm: 'X_emb'
 uns: 'dataset_id', 'method_id', 'normalization_id'

Data structure:

Slot	Type	Description
`obsm["X_emb"]`	`double`	The dimensionally reduced embedding.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["method_id"]`	`string`	A unique identifier for the method.
`uns["normalization_id"]`	`string`	Which normalization was used.

File format: Score

Metric score file

Example file: resources_test/task_dimensionality_reduction/cxg_mouse_pancreas_atlas/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'normalization_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot	Type	Description
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.
`uns["method_id"]`	`string`	A unique identifier for the method.
`uns["metric_ids"]`	`string`	One or more unique metric identifiers.
`uns["metric_values"]`	`double`	The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.

File format: Processed Embedding

A dataset with dimensionality reduction embedding that has been processed to add information required by metrics.

Example file: resources_test/task_dimensionality_reduction/cxg_mouse_pancreas_atlas/processed_embedding.h5ad

Format:

AnnData object
 obsm: 'X_emb'
 uns: 'dataset_id', 'method_id', 'normalization_id', 'between_waypoint_distances', 'label_centroids', 'waypoint_centroid_distances', 'between_centroid_distances'

Data structure:

Slot	Type	Description
`obsm["X_emb"]`	`double`	The dimensionally reduced embedding.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset.
`uns["method_id"]`	`string`	A unique identifier for the method.
`uns["normalization_id"]`	`string`	Which normalization was used.
`uns["between_waypoint_distances"]`	`double`	Euclidean distances between waypoint cells.
`uns["label_centroids"]`	`double`	Centroid positions of each label in the normalized expression space.
`uns["waypoint_centroid_distances"]`	`double`	Euclidean distances from waypoint cells to label centroids.
`uns["between_centroid_distances"]`	`double`	Euclidean distances between label centroids.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
common @ 80321bf		common @ 80321bf
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
_viash.yaml		_viash.yaml
main.nf		main.nf
nextflow.config		nextflow.config
thumbnail.svg		thumbnail.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dimensionality Reduction for Visualization

Description

Authors & contributors

API

File format: Dataset

Component type: Data processor

File format: Dataset

File format: Solution data

Component type: Control method

Component type: Method

Component type: Metric

Component type: Process embedding

File format: Embedding

File format: Score

File format: Processed Embedding

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 8

Uh oh!

Languages

License

openproblems-bio/task_dimensionality_reduction

Folders and files

Latest commit

History

Repository files navigation

Dimensionality Reduction for Visualization

Description

Authors & contributors

API

File format: Dataset

Component type: Data processor

File format: Dataset

File format: Solution data

Component type: Control method

Component type: Method

Component type: Metric

Component type: Process embedding

File format: Embedding

File format: Score

File format: Processed Embedding

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 8

Uh oh!

Languages

Packages