Removing noise in sparse single-cell RNA-sequencing count data
Repository: openproblems-bio/task_denoising
A key challenge in evaluating denoising methods is the general lack of a ground truth. A recent benchmark study (Hou et al., 2020) relied on flow-sorted datasets, mixture control experiments (Tian et al., 2019), and comparisons with bulk RNA-Seq data. Since each of these approaches suffers from specific limitations, it is difficult to combine these different approaches into a single quantitative measure of denoising accuracy. Here, we instead rely on an approach termed molecular cross-validation (MCV), which was specifically developed to quantify denoising accuracy in the absence of a ground truth (Batson et al., 2019). In MCV, the observed molecules in a given scRNA-Seq dataset are first partitioned between a training and a test dataset. Next, a denoising method is applied to the training dataset. Finally, denoising accuracy is measured by comparing the result to the test dataset. The authors show that both in theory and in practice, the measured denoising accuracy is representative of the accuracy that would be obtained on a ground truth dataset.
name | roles |
---|---|
Wesley Lewis | author, maintainer |
Scott Gigante | author, maintainer |
Robrecht Cannoodt | author |
Kai Waldrant | contributor |
flowchart TB
file_common_dataset("<a href='https://github.com/openproblems-bio/task_denoising#file-format-common-dataset'>Common Dataset</a>")
comp_data_processor[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-data-processor'>Data processor</a>"/]
file_test("<a href='https://github.com/openproblems-bio/task_denoising#file-format-test-data'>Test data</a>")
file_train("<a href='https://github.com/openproblems-bio/task_denoising#file-format-training-data'>Training data</a>")
comp_control_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-control-method'>Control Method</a>"/]
comp_metric[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-metric'>Metric</a>"/]
comp_method[/"<a href='https://github.com/openproblems-bio/task_denoising#component-type-method'>Method</a>"/]
file_prediction("<a href='https://github.com/openproblems-bio/task_denoising#file-format-denoised-data'>Denoised data</a>")
file_score("<a href='https://github.com/openproblems-bio/task_denoising#file-format-score'>Score</a>")
file_common_dataset---comp_data_processor
comp_data_processor-->file_test
comp_data_processor-->file_train
file_test---comp_control_method
file_test---comp_metric
file_train---comp_control_method
file_train---comp_method
comp_control_method-->file_prediction
comp_metric-->file_score
comp_method-->file_prediction
file_prediction---comp_metric
A subset of the common dataset.
Example file: resources_test/common/cxg_immune_cell_atlas/dataset.h5ad
Format:
AnnData object
obs: 'batch'
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
Data structure:
Slot | Type | Description |
---|---|---|
obs["batch"] |
string |
(Optional) Batch information. |
layers["counts"] |
integer |
Raw counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
A denoising dataset processor.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A subset of the common dataset. |
--output_train |
file |
(Output) The subset of molecules used for the training dataset. |
--output_test |
file |
(Output) The subset of molecules used for the test dataset. |
The subset of molecules used for the test dataset
Example file:
resources_test/task_denoising/cxg_immune_cell_atlas/test.h5ad
Format:
AnnData object
layers: 'counts'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'train_sum'
Data structure:
Slot | Type | Description |
---|---|---|
layers["counts"] |
integer |
Raw counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["train_sum"] |
integer |
The total number of counts in the training dataset. |
The subset of molecules used for the training dataset
Example file:
resources_test/task_denoising/cxg_immune_cell_atlas/train.h5ad
Format:
AnnData object
layers: 'counts'
uns: 'dataset_id'
Data structure:
Slot | Type | Description |
---|---|---|
layers["counts"] |
integer |
Raw counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
A control method.
Arguments:
Name | Type | Description |
---|---|---|
--input_train |
file |
The subset of molecules used for the training dataset. |
--input_test |
file |
The subset of molecules used for the test dataset. |
--output |
file |
(Output) A denoised dataset as output by a method. |
A metric.
Arguments:
Name | Type | Description |
---|---|---|
--input_test |
file |
The subset of molecules used for the test dataset. |
--input_prediction |
file |
A denoised dataset as output by a method. |
--output |
file |
(Output) File indicating the score of a metric. |
A method.
Arguments:
Name | Type | Description |
---|---|---|
--input_train |
file |
The subset of molecules used for the training dataset. |
--output |
file |
(Output) A denoised dataset as output by a method. |
A denoised dataset as output by a method.
Example file:
resources_test/task_denoising/cxg_immune_cell_atlas/denoised.h5ad
Format:
AnnData object
layers: 'denoised'
uns: 'dataset_id', 'method_id'
Data structure:
Slot | Type | Description |
---|---|---|
layers["denoised"] |
integer |
denoised data. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
File indicating the score of a metric.
Example file:
resources_test/task_denoising/cxg_immune_cell_atlas/score.h5ad
Format:
AnnData object
uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'
Data structure:
Slot | Type | Description |
---|---|---|
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["metric_ids"] |
string |
One or more unique metric identifiers. |
uns["metric_values"] |
double |
The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |