Code repository for the paper GLDM: Hit Molecule Generation with Constrained Graph Latent Diffusion Model.
Create the GLDM conda environment with the config file GLDM.yml
conda env create --file=GLDM.yml
To directly use trained models, please skip and refer to next section.
We provide molecule data from GuacaMol and L1000 datasets processed by the MoLeR algorithm at this share point. If other datasets is required, please refer to MoLeR for preprocessing.
Our gene expression data is downloaded from the BiAAE repository. It is also available here.
When excuting the training scripts for the first time, additional preprocessing will be done to convert data samples into pytorch_geometric data
. This process will take some time.
In addition, remember to change the following variables in the scripts:
In autoencoder/train_guacamol.py and autoencoder/train_l1000.py,
raw_moler_trace_dataset_parent_folder
andoutput_pyg_trace_dataset_parent_folder
refer to the MoLeR processed data folder and the pytorch_geometric acceptable data folder. Please unzip the trace directory downloaded from the share point, and put it underraw_moler_trace_dataset_parent_folder
. Thepytorch_geometric data
will be automatically stored inoutput_pyg_trace_dataset_parent_folder
.
In autoencoder/train_l1000.py,
gene_exp_controls_file_path
andgene_exp_tumor_file_path
stores the gene expression profiles of control and treated cell lines, andlincs_csv_file_path
stores the experiment idx. Please putrobust_normalized_controls.npz
undergene_exp_controls_file_path
, putrobust_normalized_tumors.npz
undergene_exp_tumor_file_path
, and putexperiments_filtered.csv
underlincs_csv_file_path
.
Train GLDM with WAE loss
cd autoencoder
python train_guacamol.py \
--layer_type=FiLMConv \
--model_architecture=aae \
--gradient_clip_val=0.0 \
--max_lr=1e-4 --gen_step_drop_probability=0.5 \
--using_wasserstein_loss --using_gp
- To use other GNN layer type, change
--layer_type
values intoGATConv
orGCNConv
. - For GLDM with other regularization losses:
- GLDM + GAN loss: remove
--using_wasserstein_loss --using_gp
- GLDM + VAE loss: remove
--using_wasserstein_loss --using_gp
and change--model_architecture=vae
- GLDM + GAN loss: remove
Model checkpoints will automatically be saved under the current folder.
python train_l1000.py \
--layer_type=FiLMConv \
--model_architecture=aae \
--gradient_clip_val=0.0 \
--use_oclr_scheduler \
--max_lr=1e-4 \
--using_wasserstein_loss --using_gp \
--gen_step_drop_probability=0.0
- To try other GNN layer type or regularization loss, refer to last section to modify the parameters
- To finetune a pretrained unconstrained model, configure
--pretrained_ckpt
with the pretrained model checkpoint and--pretrained_ckpt_model_type
to bevae
oraae
accordingly
Make sure the encoder model is developed before proceeding to train the latent diffusion model. Configure the path to the encoder model checkpoint in config_file
. Also remember to change the path variables likely to training the encoder and decoder.
cd ldm
python train_ldm_guacamol.py \
--layer_type=FiLMConv \
--model_architecture=aae \
--use_oclr_scheduler \
--gradient_clip_val=1.0 \
--max_lr=1e-4 \
--using_wasserstein_loss --using_gp \
--gen_step_drop_probability=0.9 \
--config_file=config/ldm_uncon+wae_uncon.yml
python train_ldm_l1000.py \
--layer_type=FiLMConv \
--model_architecture=aae \
--use_oclr_scheduler \
--gradient_clip_val=1.0 \
--max_lr=1e-4 \
--using_wasserstein_loss --using_gp \
--gen_step_drop_probability=0.9 \
--config_file=config/ldm_con+wae_con.yml
Please download the trained models from Zenodo and refer to ldm/sample_ldm.ipynb for sampling from developed models.
python distribution_learning.py \
--using_ldm \
--ldm_ckpt=model_ckpt/GLDM_WAE_uncond.ckpt \
--ldm_config=ldm/config/ldm_uncon+wae_uncon.yml \
--output_fp=guacamol_latest_ldm_wae_10000.json \
--smiles_file=guacamol_latest_ldm_wae_10000_smiles.pkl \
--number_samples=10000
The scores of the benchmarks will be shown at the end of the output file (stored at --output_fp
).
First generate the hit candidates constrained by gene expression changes in the test dataset.
python evaluate_ldm_l1000_metrics.py -d cuda -m wae
Then refer to ldm/testset_rediscovery.ipynb for evaluations.
Refer to ldm/SA_QED.ipynb for details.
20 protein-ligand complex structures are used as baselines. Meta information is recorded in binding_affinity/protein/metadata.csv.
-
Download the binding complex structures from PDB:
cd binding_affinity python download_pdb.py
-
Generate the molecules targeted at the selected proteins.
cd ldm python evaluate_ldm_l1000_metrics.py -d cuda -m wae -b
-
Save the 3D conformers of the generated molecules
cd binding_affinity python save_ligand.py -m wae
-
Molecular docking with Gnina
python baseline_docking.py python molecule_docking.py -m wae
baseline_docking
performs docking for the original molecules andmolecule_docking
deals with the generated molecules.
Change the parameter -m
accordingly if you want to check other models.
Refer to binding_affinity/evaluate_gnina.ipynb for analyzing the results.
@article{10.1093/bib/bbae142,
author = {Wang, Conghao and Ong, Hiok Hian and Chiba, Shunsuke and Rajapakse, Jagath C},
title = "{GLDM: hit molecule generation with constrained graph latent diffusion model}",
journal = {Briefings in Bioinformatics},
volume = {25},
number = {3},
pages = {bbae142},
year = {2024},
month = {04},
issn = {1477-4054},
doi = {10.1093/bib/bbae142},
url = {https://doi.org/10.1093/bib/bbae142},
eprint = {https://academic.oup.com/bib/article-pdf/25/3/bbae142/57160111/bbae142.pdf},
}