DynPerturb is an advanced deep learning model designed to infer gene regulatory networks (GRNs) and analyze the effects of perturbations on cellular states using single-cell RNA-seq data. By incorporating both temporal and spatial information, DynPerturb enhances the understanding of gene interactions during cellular development, disease progression, and response to perturbations, making it an invaluable tool for biologists and researchers in drug discovery, genetic studies, and disease modeling.
All training data and model parameters used in this study are available at https://bgipan.genomics.cn/#/link/t2YuR3VHmS0Jaozwqlvk (access code: p2Qv).
Benchmark gene pairs for mESC and hESC datasets are avaliable fromhttps://github.com/xiaoyeye/TDL.
- Adult Human Kidney Single-Cell RNA-seq (Version 1.5)
- Source: CellxGene Single-Cell Data
- This dataset includes single-cell gene expression profiles from different cell types of the human kidney.
- Human Bone Marrow Hematopoietic Development (Balanced Reference Map)
- Source: CellxGene Bone Marrow Data
- The dataset helps explore the differentiation process of blood cells from human bone marrow.
- Murine Cardiac Development Spatiotemporal Transcriptome Sequencing
- Source: Gigascience Article
- Provides a detailed spatial transcriptomic map of murine heart development, useful for understanding heart tissue differentiation and development.
| Component | Version |
|---|---|
| Operating System | Kylin Linux Advanced Server V10 (Sword) |
| Python | 3.10.16 |
| CUDA | 12.2 |
| NVIDIA Driver | 535.104.12 |
| Core Dependencies | Refer to requirements.txt |
| Hardware Item | Specification/Model |
|---|---|
| CPU Architecture | aarch64 (ARM) |
| CPU Model | HiSilicon Kunpeng-920 |
| Total RAM | 256 GB (266414208 kB) |
| GPU |
NVIDIA A100-PCIE-40GB ( |
The Python dependencies for this project are listed in the requirements.txt file.
-
Download the
requirements.txtfile from this repository. -
Create a conda environment (if you don’t have one already):
conda create --name DynPerturb python=3.10
-
Activate the conda environment:
conda activate DynPerturb
-
Install the dependencies by running the following command:
pip install -r requirements.txt
This will install all the Python dependencies needed for the project.
**Installation Time:**Estimated installation time is approximately 45 minutes.
Training Command
This script is used to train a self-supervised model for link prediction in graph-based data. The training process is designed to handle large-scale datasets and support distributed training using PyTorch's DistributedDataParallel (DDP).
python train_main_link.py -d aPT-B --use_memory --memory_updater rnn --message_function mlp > log.log 2>&1--use_memory: Enabling this option allows the model to incorporate memory augmentation for nodes during training. This can enhance the model's ability to remember historical interactions or patterns in the data, which is particularly useful for temporal graph models.--memory_updater rnn: This argument specifies the type of memory update mechanism to use. Thernnoption uses a Recurrent Neural Network (RNN) to update and manage node memory over time, making it suitable for tasks that require temporal memory updates.--message_function mlp: Themlpoption sets the message function used to process information between nodes. Specifically, it utilizes a Multi-Layer Perceptron (MLP) to aggregate and transform messages exchanged between nodes during the computation, allowing the model to learn complex relationships between nodes.
Perturbation and Extraction of Node Features
This script is designed to perform perturbation and extraction of node features in a link prediction task. Specifically, it involves the process of generating embeddings for nodes, extracting their features over time, and saving these embeddings for future use.
python train_ChangeNodeFeat_SaveEmbeddings_link.py --data HumanBone --bs 64 --n_epoch 100 --n_layer 1 Parameters:
--data: Dataset name, e.g., "HumanBone".--bs: Batch size for training.--n_epoch: Number of epochs.--n_layer: Number of network layers.--lr: Learning rate.
**Runtime:**The total computational runtime for training and extracting embeddings using 9 clusters is 184 hours (160 hours for training plus 24 hours for embedding extraction).
Expected Results:
-
Node Temporal Embeddings File-
embeddings_.json: The file contains a chronologically-ordered series of high-dimensional state vectors. Each vector documents the state of a specific biological entity (such as a gene or cell type) at a distinct point in time during the process of hematopoietic development in the bone marrow. -
Execution Log -
train.log: This is a text file for logging and debugging.
Training Command
This script is used for self-supervised node classification training with distributed data parallelism (DDP) using PyTorch. The code supports multi-GPU and multi-node training environments to scale efficiently.
python train_main_ddp.py -d HumanBone --memory_dim 1000 --use_memory --numClasses > log.log 2>&1-
--memory_dim: Sets the dimension of the memory space for the model. Thememory_dimcontrols how much memory each node will hold, which can influence model performance. -
--use_memory: This flag enables the use of node memory augmentation, which helps the model retain and utilize information from previous steps or nodes. This is particularly helpful for tasks requiring historical context. -
--num_classes:This argument specifies the number of classes for node classification tasks. It defines the total number of distinct categories or labels each node can be classified into during training. This parameter is essential for multi-class classification tasks, where the model predicts the class label for each node in the graph.
Perturbation and Extraction of Node Features
This script is designed for distributed inference on a temporal graph, focusing on perturbation and extraction of node features (i.e. embeddings), using a pretrained model. It computes temporal node embeddings across time and saves them for downstream tasks such as analysis or visualization.
python train_ChangeNodeFeat_SaveEmbeddings_ddp.py --data HumanBone --bs 64 --n_epoch 100 --n_layer 1 Parameters:
--data: The dataset name, for example, "HumanBone".--bs: Batch size used during training.--n_epoch: Number of epochs to train the model.****--n_layer: Number of layers in the neural network.--lr: Learning rate for optimization.
**Runtime:**The total computational runtime for training and extracting embeddings is 13 hours (8 hours for training plus 5 hours for embedding extraction).
Expected Results:
-
Node Temporal Embeddings File-
embeddings_.json: The file contains a chronologically-ordered series of high-dimensional state vectors, which quantify the impact of lineage-specific transcription factor perturbations on hematopoietic trajectories by documenting the state of each biological entity (e.g., a transcription factor or cell type) at distinct points in developmental time. -
Execution Log -
train.log: This is a text file for logging and debugging.
Training Command
This script is used for self-supervised node classification training with distributed data parallelism (DDP) using PyTorch. The code supports multi-GPU and multi-node training environments to scale efficiently.
python train_main_ddp.py -d mouse --memory_dim 1000 --use_memory > log.log 2>&1-
--memory_dim: Sets the dimension of the memory space for the model. Thememory_dimcontrols how much memory each node will hold, which can influence model performance. -
--use_memory: This flag enables the use of node memory augmentation, which helps the model retain and utilize information from previous steps or nodes. This is particularly helpful for tasks requiring historical context. -
--num_classes:This argument specifies the number of classes for node classification tasks. It defines the total number of distinct categories or labels each node can be classified into during training. This parameter is essential for multi-class classification tasks, where the model predicts the class label for each node in the graph.
Perturbation and Extraction of Node Features
This script is designed for distributed inference on a temporal graph, focusing on perturbation and extraction of node features (i.e. embeddings), using a pretrained model. It computes temporal node embeddings across time and saves them for downstream tasks such as analysis or visualization.
python train_ChangeNodeFeat_SaveEmbeddings_ddp.py --data mouse --bs 64 --n_epoch 100 --n_layer 1 Parameters:
--data: The dataset name, for example, "HumanBone".--bs: Batch size used during training.--n_epoch: Number of epochs to train the model.--n_layer: Number of layers in the neural network.--lr: Learning rate for optimization.
**Runtime:**The total computational runtime for training and extracting embeddings is 6 hours (4 hours for training plus 2 hours for embedding extraction).
Expected Results:
-
(Node Spatiotemporal Embeddings File-
embeddings_.json: The file provides a quantitative, spatiotemporal atlas delineating cardiac development at the molecular level. -
Execution Log -
train.log: This is a text file for logging and debugging.
This project is licensed under the MIT License. See the LICENSE file for details.
