Remove unwanted batch effects from scRNA-seq data while retaining biologically meaningful variation.
Repository: openproblems-bio/task_batch_integration
As single-cell technologies advance, single-cell datasets are growing both in size and complexity. Especially in consortia such as the Human Cell Atlas, individual studies combine data from multiple labs, each sequencing multiple individuals possibly with different technologies. This gives rise to complex batch effects in the data that must be computationally removed to perform a joint analysis. These batch integration methods must remove the batch effect while not removing relevant biological information. Currently, over 200 tools exist that aim to remove batch effects scRNA-seq datasets [@zappia2018exploring]. These methods balance the removal of batch effects with the conservation of nuanced biological information in different ways. This abundance of tools has complicated batch integration method choice, leading to several benchmarks on this topic [@luecken2020benchmarking; @tran2020benchmark; @chazarragil2021flexible; @mereu2020benchmarking]. Yet, benchmarks use different metrics, method implementations and datasets. Here we build a living benchmarking task for batch integration methods with the vision of improving the consistency of method evaluation.
In this task we evaluate batch integration methods on their ability to remove batch effects in the data while conserving variation attributed to biological effects. As input, methods require either normalised or unnormalised data with multiple batches and consistent cell type labels. The batch integrated output can be a feature matrix, a low dimensional embedding and/or a neighbourhood graph. The respective batch-integrated representation is then evaluated using sets of metrics that capture how well batch effects are removed and whether biological variance is conserved. We have based this particular task on the latest, and most extensive benchmark of single-cell data integration methods.
name | roles |
---|---|
Michaela Mueller | maintainer, author |
Malte Luecken | author |
Daniel Strobl | author |
Robrecht Cannoodt | contributor |
Scott Gigante | contributor |
Kai Waldrant | contributor |
Nartin Kim | contributor |
flowchart TB
file_common_dataset("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-common-dataset'>Common Dataset</a>")
comp_process_dataset[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-data-processor'>Data processor</a>"/]
file_dataset("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-dataset'>Dataset</a>")
file_solution("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-solution'>Solution</a>")
comp_control_method[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-control-method'>Control method</a>"/]
comp_method[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-method'>Method</a>"/]
comp_process_integration[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-process-integration'>Process integration</a>"/]
comp_metric[/"<a href='https://github.com/openproblems-bio/task_batch_integration#component-type-metric'>Metric</a>"/]
file_integrated("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-integration'>Integration</a>")
file_integrated_processed("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-processed-integration-output'>Processed integration output</a>")
file_score("<a href='https://github.com/openproblems-bio/task_batch_integration#file-format-score'>Score</a>")
file_common_dataset---comp_process_dataset
comp_process_dataset-->file_dataset
comp_process_dataset-->file_solution
file_dataset---comp_control_method
file_dataset---comp_method
file_dataset---comp_process_integration
file_solution---comp_control_method
file_solution---comp_metric
comp_control_method-->file_integrated
comp_method-->file_integrated
comp_process_integration-->file_integrated_processed
comp_metric-->file_score
file_integrated---comp_process_integration
file_integrated_processed---comp_metric
A subset of the common dataset.
Example file: resources_test/common/cxg_immune_cell_atlas/dataset.h5ad
Format:
AnnData object
obs: 'cell_type', 'batch'
var: 'hvg', 'hvg_score', 'feature_name', 'feature_id'
obsm: 'X_pca'
obsp: 'knn_distances', 'knn_connectivities'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'knn'
Data structure:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
Cell type information. |
obs["batch"] |
string |
Batch information. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["feature_id"] |
string |
A database identifier for the feature, usually an ENSEMBL ID. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
obsp["knn_distances"] |
double |
K nearest neighbors distance matrix. |
obsp["knn_connectivities"] |
double |
K nearest neighbors connectivities matrix. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["knn"] |
object |
(Optional) Supplementary K nearest neighbors data. |
A label projection dataset processor.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
A subset of the common dataset. |
--output_dataset |
file |
(Output) Unintegrated AnnData HDF5 file. |
--output_solution |
file |
(Output) Uncensored dataset containing the true labels. |
--hvgs |
integer |
(Optional) NA. Default: 2000 . |
Unintegrated AnnData HDF5 file.
Example file:
resources_test/task_batch_integration/cxg_immune_cell_atlas/dataset.h5ad
Format:
AnnData object
obs: 'cell_type', 'batch'
var: 'hvg', 'hvg_score', 'feature_name', 'feature_id'
obsm: 'X_pca'
obsp: 'knn_distances', 'knn_connectivities'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'normalization_id', 'dataset_organism', 'knn'
Data structure:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
Cell type information. |
obs["batch"] |
string |
Batch information. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["feature_id"] |
string |
A database identifier for the feature, usually an ENSEMBL ID. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
obsp["knn_distances"] |
double |
K nearest neighbors distance matrix. |
obsp["knn_connectivities"] |
double |
K nearest neighbors connectivities matrix. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["knn"] |
object |
Supplementary K nearest neighbors data. |
Uncensored dataset containing the true labels.
Example file:
resources_test/task_batch_integration/cxg_immune_cell_atlas/solution.h5ad
Format:
AnnData object
obs: 'cell_type', 'batch'
var: 'feature_name', 'feature_id', 'hvg', 'hvg_score', 'batch_hvg'
obsm: 'X_pca'
obsp: 'knn_distances', 'knn_connectivities'
layers: 'counts', 'normalized'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'knn'
Data structure:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
Cell type information. |
obs["batch"] |
string |
Batch information. |
var["feature_name"] |
string |
A human-readable name for the feature, usually a gene symbol. |
var["feature_id"] |
string |
A database identifier for the feature, usually an ENSEMBL ID. |
var["hvg"] |
boolean |
Whether or not the feature is considered to be a ‘highly variable gene’. |
var["hvg_score"] |
double |
A ranking of the features by hvg. |
var["batch_hvg"] |
boolean |
Whether or not the feature is considered to be a batch-aware ‘highly variable gene’. |
obsm["X_pca"] |
double |
The resulting PCA embedding. |
obsp["knn_distances"] |
double |
K nearest neighbors distance matrix. |
obsp["knn_connectivities"] |
double |
K nearest neighbors connectivities matrix. |
layers["counts"] |
integer |
Raw counts. |
layers["normalized"] |
double |
Normalized expression values. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["dataset_name"] |
string |
Nicely formatted name. |
uns["dataset_url"] |
string |
(Optional) Link to the original source of the dataset. |
uns["dataset_reference"] |
string |
(Optional) Bibtex reference of the paper in which the dataset was published. |
uns["dataset_summary"] |
string |
Short description of the dataset. |
uns["dataset_description"] |
string |
Long description of the dataset. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["knn"] |
object |
Supplementary K nearest neighbors data. |
A control method for the batch integration task.
Arguments:
Name | Type | Description |
---|---|---|
--input_dataset |
file |
Unintegrated AnnData HDF5 file. |
--input_solution |
file |
Uncensored dataset containing the true labels. |
--output |
file |
(Output) An integrated AnnData dataset. |
A method for the batch integration task.
Arguments:
Name | Type | Description |
---|---|---|
--input |
file |
Unintegrated AnnData HDF5 file. |
--output |
file |
(Output) An integrated AnnData dataset. |
Process output from an integration method to the format expected by metrics
Arguments:
Name | Type | Description |
---|---|---|
--input_dataset |
file |
Unintegrated AnnData HDF5 file. |
--input_integrated |
file |
An integrated AnnData dataset. |
--expected_method_types |
string |
NA. |
--expected_method_types |
string |
NA. |
--expected_method_types |
string |
NA. |
--output |
file |
(Output) An integrated AnnData dataset with additional outputs. |
A metric for evaluating batch integration methods.
Arguments:
Name | Type | Description |
---|---|---|
--input_integrated |
file |
An integrated AnnData dataset with additional outputs. |
--input_solution |
file |
Uncensored dataset containing the true labels. |
--output |
file |
(Output) Metric score file. |
An integrated AnnData dataset.
Example file:
resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated.h5ad
Description:
Must contain at least one of:
- Feature: the corrected_counts layer
- Embedding: the X_emb obsm
- Graph: the connectivities and distances obsp
Format:
AnnData object
obsm: 'X_emb'
obsp: 'connectivities', 'distances'
layers: 'corrected_counts'
uns: 'dataset_id', 'normalization_id', 'dataset_organism', 'method_id', 'neighbors'
Data structure:
Slot | Type | Description |
---|---|---|
obsm["X_emb"] |
double |
(Optional) Embedding output - 2D coordinate matrix. |
obsp["connectivities"] |
double |
(Optional) Graph output - neighbor connectivities matrix. |
obsp["distances"] |
double |
(Optional) Graph output - neighbor distances matrix. |
layers["corrected_counts"] |
double |
(Optional) Feature output - corrected counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["neighbors"] |
object |
(Optional) Supplementary K nearest neighbors data. |
An integrated AnnData dataset with additional outputs.
Example file:
resources_test/task_batch_integration/cxg_immune_cell_atlas/integrated_processed.h5ad
Description:
Must contain at least one of:
- Feature: the corrected_counts layer
- Embedding: the X_emb obsm
- Graph: the connectivities and distances obsp
The Graph should always be present, but the Feature and Embedding are optional.
Format:
AnnData object
obsm: 'X_emb', 'clustering'
obsp: 'connectivities', 'distances'
layers: 'corrected_counts'
uns: 'dataset_id', 'normalization_id', 'dataset_organism', 'method_id', 'neighbors'
Data structure:
Slot | Type | Description |
---|---|---|
obsm["X_emb"] |
double |
(Optional) Embedding output - 2D coordinate matrix. |
obsm["clustering"] |
integer |
Leiden clustering results at different resolutions. |
obsp["connectivities"] |
double |
Graph output - neighbor connectivities matrix. |
obsp["distances"] |
double |
Graph output - neighbor distances matrix. |
layers["corrected_counts"] |
double |
(Optional) Feature output - corrected counts. |
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["dataset_organism"] |
string |
(Optional) The organism of the sample in the dataset. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["neighbors"] |
object |
Supplementary K nearest neighbors data. |
Metric score file
Example file: score.h5ad
Format:
AnnData object
uns: 'dataset_id', 'normalization_id', 'method_id', 'metric_ids', 'metric_values'
Data structure:
Slot | Type | Description |
---|---|---|
uns["dataset_id"] |
string |
A unique identifier for the dataset. |
uns["normalization_id"] |
string |
Which normalization was used. |
uns["method_id"] |
string |
A unique identifier for the method. |
uns["metric_ids"] |
string |
One or more unique metric identifiers. |
uns["metric_values"] |
double |
The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |