Skip to content

openproblems-bio/task_denoising

Repository files navigation

Denoising

Removing noise in sparse single-cell RNA-sequencing count data

Repository: openproblems-bio/task_denoising

Description

A key challenge in evaluating denoising methods is the general lack of a ground truth. A recent benchmark study (Hou et al., 2020) relied on flow-sorted datasets, mixture control experiments (Tian et al., 2019), and comparisons with bulk RNA-Seq data. Since each of these approaches suffers from specific limitations, it is difficult to combine these different approaches into a single quantitative measure of denoising accuracy. Here, we instead rely on an approach termed molecular cross-validation (MCV), which was specifically developed to quantify denoising accuracy in the absence of a ground truth (Batson et al., 2019). In MCV, the observed molecules in a given scRNA-Seq dataset are first partitioned between a training and a test dataset. Next, a denoising method is applied to the training dataset. Finally, denoising accuracy is measured by comparing the result to the test dataset. The authors show that both in theory and in practice, the measured denoising accuracy is representative of the accuracy that would be obtained on a ground truth dataset.

Authors & contributors

name roles
Wesley Lewis author, maintainer
Scott Gigante author, maintainer
Robrecht Cannoodt author
Kai Waldrant contributor

API

flowchart TB
  file_common_dataset("<a href='https://github.com/openproblems-bio/task_denoising#file-format-common-dataset'>Common Dataset</a>")
  comp_data_processor[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-data-processor'>Data processor</a>"/]
  file_test("<a href='https://github.com/openproblems-bio/task_denoising#file-format-test-data'>Test data</a>")
  file_train("<a href='https://github.com/openproblems-bio/task_denoising#file-format-training-data'>Training data</a>")
  comp_control_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-control-method'>Control Method</a>"/]
  comp_metric[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-metric'>Metric</a>"/]
  comp_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-method'>Method</a>"/]
  file_prediction("<a href='https://github.com/openproblems-bio/task_denoising#file-format-denoised-data'>Denoised data</a>")
  file_score("<a href='https://github.com/openproblems-bio/task_denoising#file-format-score'>Score</a>")
  file_common_dataset---comp_data_processor
  comp_data_processor-->file_test
  comp_data_processor-->file_train
  file_test---comp_control_method
  file_test---comp_metric
  file_train---comp_control_method
  file_train---comp_method
  comp_control_method-->file_prediction
  comp_metric-->file_score
  comp_method-->file_prediction
  file_prediction---comp_metric
Loading

File format: Common Dataset

A subset of the common dataset.

Example file: resources_test/common/cxg_immune_cell_atlas/dataset.h5ad

Format:

AnnData object
 obs: 'batch'
 layers: 'counts'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

Data structure:

Slot Type Description
obs["batch"] string (Optional) Batch information.
layers["counts"] integer Raw counts.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.

Component type: Data processor

A denoising dataset processor.

Arguments:

Name Type Description
--input file A subset of the common dataset.
--output_train file (Output) The subset of molecules used for the training dataset.
--output_test file (Output) The subset of molecules used for the test dataset.

File format: Test data

The subset of molecules used for the test dataset

Example file: resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad

Format:

AnnData object
 layers: 'counts'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'

Data structure:

Slot Type Description
layers["counts"] integer Raw counts.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["train_sum"] integer The total number of counts in the training dataset.

File format: Training data

The subset of molecules used for the training dataset

Example file: resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad

Format:

AnnData object
 layers: 'counts'
 uns: 'dataset_id'

Data structure:

Slot Type Description
layers["counts"] integer Raw counts.
uns["dataset_id"] string A unique identifier for the dataset.

Component type: Control Method

A control method.

Arguments:

Name Type Description
--input_train file The subset of molecules used for the training dataset.
--input_test file The subset of molecules used for the test dataset.
--output file (Output) A denoised dataset as output by a method.

Component type: Metric

A metric.

Arguments:

Name Type Description
--input_test file The subset of molecules used for the test dataset.
--input_prediction file A denoised dataset as output by a method.
--output file (Output) File indicating the score of a metric.

Component type: Method

A method.

Arguments:

Name Type Description
--input_train file The subset of molecules used for the training dataset.
--output file (Output) A denoised dataset as output by a method.

File format: Denoised data

A denoised dataset as output by a method.

Example file: resources_test/task_denoising/cxg_immune_cell_atlas/denoised.h5ad

Format:

AnnData object
 layers: 'denoised'
 uns: 'dataset_id', 'method_id'

Data structure:

Slot Type Description
layers["denoised"] integer denoised data.
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.

File format: Score

File indicating the score of a metric.

Example file: resources_test/task_denoising/cxg_immune_cell_atlas/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot Type Description
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.
uns["metric_ids"] string One or more unique metric identifiers.
uns["metric_values"] double The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.