Kaggle competition project--Single cell perturbations
selected 144 compounds(2--dabrfenib and belinostat as postive controls and DMSO as negative control) from LINCS to PBMCs from 3 donors
The plate contains 96 wells, each well contains PBMCs from a donor(each well contains cells belonging to all cell types), 72 wells--compound, 16--positive controls, 8--negative controls, The full dataset comprises 2 different compound plates per donor for a total of 6 plates and 350 cells per well
Why introduce two positive controls and negative controls? One reason is that when we cell multiplexing(pool all samples in each row into a single pool for sequencing), two positive controls and one negative control in each row of the plate is to allow us to account for this source of noise when we calculate differential expression.
there is no DE data for the DMSO sample, because it is the negative control. All DE output is calculated in reference to the DMSO, i.e. the DE analysis asks "how confident am I that each gene increased or decreased relative to DMSO due to the compound treatment".
- Training dataset: All compounds in T, NK cells and 10% of the compounds in B and Myeloid cells
- Testing dataset: randomly chosen compounds in B and Myeloid cells
-
de_train.parquet
614 cells, 18211 genes(The first 5 columns are cell types/compound pair and Boolean indicator of control)
-
adata_train.parquet
adopt different format--COO sparse--array format, other fileds: obs_id...
Modelling differential expression, predict the gene expression differential data in reference to the negative controls(DMSO)
Mean Rowwise Root Mean Squared Error(MRRMSE)
i: represent the cells, and j: represent the genes
Several methods have been developed for drug perturbation prediction, most of which are variations on the autoencoder architecture (Dr.VAE, scGEN, and ChemCPA).