Skip to content

openproblems-bio/task_perturbation_prediction

Repository files navigation

Perturbation Prediction

Predicting how small molecules change gene expression in different cell types.

Path to source: src

README

Installation

You need to have Docker, Java, and Viash installed. Follow these instructions to install the required dependencies.

Add a method

To add a method to the repository, follow the instructions in the scripts/add_a_method.sh script.

Frequently used commands

To get started, you can run the following commands:

git clone git@github.com:openproblems-bio/task_perturbation_prediction.git

cd task_perturbation_prediction

# download resources
scripts/download_resources.sh

To run the benchmark, you first need to build the components. Afterwards, you can run the benchmark:

viash ns build --parallel --setup cachedbuild

scripts/run_benchmark.sh

After adding a component, it is recommended to run the tests to ensure that the component is working correctly:

viash ns test --parallel

Optionally, you can provide the --query argument to test only a subset of components:

viash ns test --parallel --query "component_name"

Motivation

Human biology can be complex, in part due to the function and interplay of the body’s approximately 37 trillion cells, which are organized into tissues, organs, and systems. However, recent advances in single-cell technologies have provided unparalleled insight into the function of cells and tissues at the level of DNA, RNA, and proteins. Yet leveraging single-cell methods to develop medicines requires mapping causal links between chemical perturbations and the downstream impact on cell state. These experiments are costly and labor intensive, and not all cells and tissues are amenable to high-throughput transcriptomic screening. If data science could help accurately predict chemical perturbations in new cell types, it could accelerate and expand the development of new medicines.

Several methods have been developed for drug perturbation prediction, most of which are variations on the autoencoder architecture (Dr.VAE, scGEN, and ChemCPA). However, these methods lack proper benchmarking datasets with diverse cell types to determine how well they generalize. The largest available training dataset is the NIH-funded Connectivity Map (CMap), which comprises over 1.3M small molecule perturbation measurements. However, the CMap includes observations of only 978 genes, less than 5% of all genes. Furthermore, the CMap data is comprised almost entirely of measurements in cancer cell lines, which may not accurately represent human biology.

Description

This task aims to predict how small molecules change gene expression in different cell types. This task was a Kaggle competition as part of the NeurIPS 2023 competition track.

The task is to predict the gene expression profile of a cell after a small molecule perturbation. For this competition, we designed and generated a novel single-cell perturbational dataset in human peripheral blood mononuclear cells (PBMCs). We selected 144 compounds from the Library of Integrated Network-Based Cellular Signatures (LINCS) Connectivity Map dataset (PMID: 29195078) and measured single-cell gene expression profiles after 24 hours of treatment. The experiment was repeated in three healthy human donors, and the compounds were selected based on diverse transcriptional signatures observed in CD34+ hematopoietic stem cells (data not released). We performed this experiment in human PBMCs because the cells are commercially available with pre-obtained consent for public release and PBMCs are a primary, disease-relevant tissue that contains multiple mature cell types (including T-cells, B-cells, myeloid cells, and NK cells) with established markers for annotation of cell types. To supplement this dataset, we also measured cells from each donor at baseline with joint scRNA and single-cell chromatin accessibility measurements using the 10x Multiome assay. We hope that the addition of rich multi-omic data for each donor and cell type at baseline will help establish biological priors that explain the susceptibility of particular genes to exhibit perturbation responses in difference biological contexts.

Authors & contributors

name roles
Artur Szałata author
Robrecht Cannoodt author
Daniel Burkhardt author
Malte D. Luecken author
Tin M. Tunjic contributor
Mengbo Wang contributor
Andrew Benz author
Tianyu Liu contributor
Jalil Nourisa contributor
Rico Meinl contributor

API

flowchart LR
  file_sc_counts("Single Cell Counts")
  comp_process_dataset[/"Process dataset"/]
  file_de_train("DE train")
  file_de_test("DE test")
  file_id_map("ID Map")
  comp_control_method[/"Control Method"/]
  comp_method[/"Method"/]
  comp_metric[/"Metric"/]
  file_prediction("Prediction")
  file_model("Model")
  file_score("Score")
  file_sc_counts---comp_process_dataset
  comp_process_dataset-->file_de_train
  comp_process_dataset-->file_de_test
  comp_process_dataset-->file_id_map
  file_de_train---comp_control_method
  file_de_train---comp_method
  file_de_test---comp_control_method
  file_de_test---comp_metric
  file_id_map---comp_control_method
  file_id_map---comp_method
  comp_control_method-->file_prediction
  comp_method-->file_prediction
  comp_method-->file_model
  comp_metric-->file_score
  file_prediction---comp_metric
Loading

File format: Single Cell Counts

Anndata with the counts of the whole dataset.

Example file: resources/neurips-2023-raw/sc_counts.h5ad

Format:

AnnData object
 obs: 'dose_uM', 'timepoint_hr', 'raw_cell_id', 'hashtag_id', 'well', 'container_format', 'row', 'col', 'plate_name', 'cell_id', 'cell_type', 'split', 'donor_id', 'sm_name'
 obsm: 'HTO_clr', 'X_pca', 'X_umap', 'protein_counts'
 layers: 'counts'

Slot description:

Slot Type Description
obs["dose_uM"] integer Dose in micromolar.
obs["timepoint_hr"] float Time point measured in hours.
obs["raw_cell_id"] string Original cell identifier.
obs["hashtag_id"] string Identifier for hashtag oligo.
obs["well"] string Well location in the plate.
obs["container_format"] string Format of the container (e.g., 96-well plate).
obs["row"] string Row in the plate.
obs["col"] integer Column in the plate.
obs["plate_name"] string Name of the plate.
obs["cell_id"] string Unique cell identifier.
obs["cell_type"] string Type of cell (e.g., B cells, T cells CD4+).
obs["split"] string Dataset split type (e.g., control, treated).
obs["donor_id"] string Identifier for the donor.
obs["sm_name"] string Name of the small molecule used for treatment.
obsm["HTO_clr"] matrix Corrected counts for hashing tags.
obsm["X_pca"] matrix Principal component analysis results.
obsm["X_umap"] matrix UMAP dimensionality reduction results.
obsm["protein_counts"] matrix Count data for proteins.
layers["counts"] matrix Raw count data for each gene across cells.

Component type: Process dataset

Path: src/process_dataset

Process the raw dataset

Arguments:

Name Type Description
--sc_counts file Anndata with the counts of the whole dataset.
--de_train file (Output) Differential expression results for training. Default: de_train.h5ad.
--de_test file (Output) Differential expression results for testing. Default: de_test.h5ad.
--id_map file (Output) File indicates the order of de_test, the cell types and the small molecule names. Default: id_map.csv.

File format: DE train

Differential expression results for training.

Example file: resources/datasets/neurips-2023-data/de_train.h5ad

Format:

AnnData object
 obs: 'cell_type', 'sm_name', 'sm_lincs_id', 'SMILES', 'split', 'control'
 layers: 'logFC', 'AveExpr', 't', 'P.Value', 'adj.P.Value', 'B', 'is_de', 'is_de_adj', 'sign_log10_pval', 'clipped_sign_log10_pval'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'single_cell_obs'

Slot description:

Slot Type Description
obs["cell_type"] string The annotated cell type of each cell based on RNA expression.
obs["sm_name"] string The primary name for the (parent) compound (in a standardized representation) as chosen by LINCS. This is provided to map the data in this experiment to the LINCS Connectivity Map data.
obs["sm_lincs_id"] string The global LINCS ID (parent) compound (in a standardized representation). This is provided to map the data in this experiment to the LINCS Connectivity Map data.
obs["SMILES"] string Simplified molecular-input line-entry system (SMILES) representations of the compounds used in the experiment. This is a 1D representation of molecular structure. These SMILES are provided by Cellarity based on the specific compounds ordered for this experiment.
obs["split"] string Split. Must be one of ‘control’, ‘train’, ‘public_test’, or ‘private_test’.
obs["control"] boolean Boolean indicating whether this instance was used as a control.
layers["logFC"] double Log fold change of the differential expression test.
layers["AveExpr"] double (Optional) Average expression of the differential expression test.
layers["t"] double (Optional) T-statistic of the differential expression test.
layers["P.Value"] double P-value of the differential expression test.
layers["adj.P.Value"] double Adjusted P-value of the differential expression test.
layers["B"] double (Optional) B-statistic of the differential expression test.
layers["is_de"] boolean Whether the gene is differentially expressed.
layers["is_de_adj"] boolean Whether the gene is differentially expressed after adjustment.
layers["sign_log10_pval"] double Differential expression value (-log10(p-value) * sign(LFC)) for each gene. Here, LFC is the estimated log-fold change in expression between the treatment and control condition after shrinkage as calculated by Limma. Positive LFC means the gene goes up in the treatment condition relative to the control.
layers["clipped_sign_log10_pval"] double A clipped version of the sign_log10_pval layer. Values are clipped to be between -4 and 4 (i.e. -log10(0.0001) and -log10(0.0001)).
uns["dataset_id"] string A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.
uns["dataset_name"] string A human-readable name for the dataset.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["single_cell_obs"] dataframe A dataframe with the cell-level metadata for the training set.

File format: DE test

Differential expression results for testing.

Example file: resources/datasets/neurips-2023-data/de_test.h5ad

Format:

AnnData object
 obs: 'cell_type', 'sm_name', 'sm_lincs_id', 'SMILES', 'split', 'control'
 layers: 'logFC', 'AveExpr', 't', 'P.Value', 'adj.P.Value', 'B', 'is_de', 'is_de_adj', 'sign_log10_pval', 'clipped_sign_log10_pval'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'single_cell_obs'

Slot description:

Slot Type Description
obs["cell_type"] string The annotated cell type of each cell based on RNA expression.
obs["sm_name"] string The primary name for the (parent) compound (in a standardized representation) as chosen by LINCS. This is provided to map the data in this experiment to the LINCS Connectivity Map data.
obs["sm_lincs_id"] string The global LINCS ID (parent) compound (in a standardized representation). This is provided to map the data in this experiment to the LINCS Connectivity Map data.
obs["SMILES"] string Simplified molecular-input line-entry system (SMILES) representations of the compounds used in the experiment. This is a 1D representation of molecular structure. These SMILES are provided by Cellarity based on the specific compounds ordered for this experiment.
obs["split"] string Split. Must be one of ‘control’, ‘train’, ‘public_test’, or ‘private_test’.
obs["control"] boolean Boolean indicating whether this instance was used as a control.
layers["logFC"] double Log fold change of the differential expression test.
layers["AveExpr"] double (Optional) Average expression of the differential expression test.
layers["t"] double (Optional) T-statistic of the differential expression test.
layers["P.Value"] double P-value of the differential expression test.
layers["adj.P.Value"] double Adjusted P-value of the differential expression test.
layers["B"] double (Optional) B-statistic of the differential expression test.
layers["is_de"] boolean Whether the gene is differentially expressed.
layers["is_de_adj"] boolean Whether the gene is differentially expressed after adjustment.
layers["sign_log10_pval"] double Differential expression value (-log10(p-value) * sign(LFC)) for each gene. Here, LFC is the estimated log-fold change in expression between the treatment and control condition after shrinkage as calculated by Limma. Positive LFC means the gene goes up in the treatment condition relative to the control.
layers["clipped_sign_log10_pval"] double A clipped version of the sign_log10_pval layer. Values are clipped to be between -4 and 4 (i.e. -log10(0.0001) and -log10(0.0001)).
uns["dataset_id"] string A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.
uns["dataset_name"] string A human-readable name for the dataset.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["single_cell_obs"] dataframe A dataframe with the cell-level metadata.

File format: ID Map

File indicates the order of de_test, the cell types and the small molecule names.

Example file: resources/datasets/neurips-2023-data/id_map.csv

Format:

Tabular data
 'id', 'cell_type', 'sm_name'

Slot description:

Column Type Description
id integer Index of the test observation.
cell_type string Cell type name.
sm_name string Small molecule name.

Component type: Control Method

Path: src/control_methods

A control method.

Arguments:

Name Type Description
--de_train file (Optional) Differential expression results for training.
--de_test file Differential expression results for testing.
--id_map file File indicates the order of de_test, the cell types and the small molecule names.
--layer string (Optional) Which layer to use for prediction. Default: clipped_sign_log10_pval.
--output file (Output) Differential Gene Expression prediction.

Component type: Method

Path: src/methods

A perturbation prediction method

Arguments:

Name Type Description
--de_train file (Optional) Differential expression results for training.
--id_map file File indicates the order of de_test, the cell types and the small molecule names.
--layer string (Optional) Which layer to use for prediction. Default: clipped_sign_log10_pval.
--output file (Output) Differential Gene Expression prediction.
--output_model file (Optional, Output) Optional model output. If no value is passed, the model will be removed at the end of the run.

Component type: Metric

Path: src/metrics

A perturbation prediction metric

Arguments:

Name Type Description
--de_test file Differential expression results for testing.
--de_test_layer string (Optional) In which layer to find the DE data. Default: clipped_sign_log10_pval.
--prediction file Differential Gene Expression prediction.
--prediction_layer string (Optional) In which layer to find the predicted DE data. Default: prediction.
--output file (Output) File indicating the score of a metric.
--resolve_genes string (Optional) How to resolve difference in genes between the two datasets. Default: de_test.
--resolve_genes string (Optional) How to resolve difference in genes between the two datasets. Default: de_test.

File format: Prediction

Differential Gene Expression prediction

Example file: resources/datasets/neurips-2023-data/prediction.h5ad

Format:

AnnData object
 layers: 'prediction'
 uns: 'dataset_id', 'method_id'

Slot description:

Slot Type Description
layers["prediction"] double Predicted differential gene expression.
uns["dataset_id"] string A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.
uns["method_id"] string A unique identifier for the method used to generate the prediction.

File format: Model

Optional model output. If no value is passed, the model will be removed at the end of the run.

Example file: resources/datasets/neurips-2023-data/model/

Format:

Slot description:

File format: Score

File indicating the score of a metric.

Example file: resources/datasets/neurips-2023-data/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Slot description:

Slot Type Description
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.
uns["metric_ids"] string One or more unique metric identifiers.
uns["metric_values"] double The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’.