CARDAMOM is an executable gene regulatory network (GRN) inference method, adapted to time-course scRNA-seq datasets. The algorithm consists in calibrating the parameters of a mechanistic model of gene expression: the calibrated model can then be simulated, which allows to reproduce the dataset used for inference. The inference method has been introduced in [1]. It has been benchmarked along with other GRN inference tools and applied to a real dataset in [2]. The simulation part is based on the Harissa package.
The package depends on standard scientific libraries numpy
and scipy
. It also depends on numba
which accelerates the inference method as well as the simulations from harissa
. The package alive_progress
is used to show the progression of simulate_data
. Finally, umap-learn
and matplotlib
are used for the function visualize_data
. They can be installed using pip:
pip install umap-learn alive_progress harissa==3.0.7
For each analysis, the user must create a dedicated directory myproject
containing the following folders:
cardamom/
can be empty before inference, will contain the results of theinfer_network.py
script;Data/
must contain two files, one namedpanel_real.txt
containing the dataset, and one namedpanel_genes.txt
containing the gene names;Rates/
must contain a filedegradation_rates.txt
that will be used by thesimulate_data.py
script;Results/
can be empty before inference, will contain the results of thevisualize_data.py
script.
An example is tests/Network4
which corresponds to one of the networks used in [2].
Prior to analyzing an experimental scRNA-seq data with CARDAMOM, we recommend that you perform the following steps:
- Select a list of relevant genes;
- Define degradation rates for the corresponding mRNAs and proteins;
- Generate the count table.
This is a critical and not so easy task. You should aim to obtain a list in the hundred range. CARDAMOM can perform with more but the resulting output GRN will be less and less interpretable as the number of genes grows. For the time being, we recommend starting with say 50 genes, and add more if needed (i.e. if the overall data generation is improved).
The selection step can be performed using any combination of the following: (1) genes known in the litterature to be important for the process being studied; (2) HVG from custom algorithms like scran or Seurat; (3) “dynamically important” genes; (4) differentially expressed genes using a basic t-test between two time points; (5) most delta-entropic genes between two time points.
This selected list of genes will have to be incremented by adding on the first line a “gene” called “Stimulus”. It should take a zero value for the 0h time point and 1 for all the others (see below). The list of gene names should be saved as myproject/Data/panel_genes.txt
.
You can find half-lives for mRNAs and proteins in human in two papers: Blumberg et al. (2021) and Li et al. (2021). Only “short-lived” protein half-lives are estimated in the Li paper. In any case, the “maximum” half-life of proteins will be set by the cell cycle duration. So make sure that no protein has a larger half-life that the cell cycle duration. A compiled table can be found here.
You will then have to generate the degradation rates from the half-lives using the following formula: degradation_rate = log(2)/half_life
. The list of degradation rates should be saved as myproject/Rates/degradation_rates.txt
.
CARDAMOM will need a count table in .txt format with cells as columns and genes as rows. The first row must correspond to the time points at which the cells are sampled, and the first column to the index of each gene. Then, each line represents the mRNA counts associated to a gene for each cell at each timepoint. Note that the second line corresponds to the stimulus, which is set to 0 at t = 0h and to 1 at t > 0h (see [1] section 5.1 for details). CARDAMOM will be expecting integer values (mRNA counts).
Run the following script for calibrating the model from myproject/Data/panel.txt
:
python infer_network.py -i [myproject]
The output consists of the following files saved in myproject/cardamom/
where G is the number of genes including stimulus and T is the number of data time points:
basal_t.npy
Matrix of size T × G × 1 containing GRN basal parameters inferred at each timepoint;inter_t.npy
Matrix of size T × G × G containing GRN interaction parameters inferred at each timepoint;basal.npy
Matrix of size G × 1 withbasal = basal_t[-1]
containing final basal parameters;inter.npy
Matrix of size G × G withinter = inter_t[-1]
containing final interaction parameters;kmin.npy
Vector of size G × 1 containing the minimal burst frequency for each gene;kmax.npy
Vector of size G × 1 containing the maximal burst frequency for each gene;bet.npy
Vector of size G × 1 containing the inverse burst size parameter for each gene;data_bool.npy
Matrix of the same size as the data, used for initializing simulations.
Run the following script for simulating the model from the parameters stored in myproject/cardamom
:
python simulate_data.py -i [myproject]
The output is saved as myproject/Data/panel_simulated.txt
.
Run the following script for comparing UMAP representations of myproject/Data/panel_real.txt
and myproject/Data/panel_simulated.txt
:
python visualize_data.py -i [myproject]
The output images are saved in myproject/Results/
.
[1] E. Ventre. Reverse engineering of a mechanistic model of gene expression using metastability and temporal dynamics. In Silico Biology, 2021.
[2] E. Ventre, U. Herbach et al. One model fits all: Combining inference and simulation of gene regulatory networks. PLOS Computational Biology, 2023.