Skip to content

SCSE-Biomedical-Computing-Group/GLDM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GLDM: Hit Molecule Generation with Constrained Graph Latent Diffusion Model

Code repository for the paper GLDM: Hit Molecule Generation with Constrained Graph Latent Diffusion Model.

Environment setup

Create the GLDM conda environment with the config file GLDM.yml

conda env create --file=GLDM.yml

Training GLDM from scratch

To directly use trained models, please skip and refer to next section.

Data preprocessing

We provide molecule data from GuacaMol and L1000 datasets processed by the MoLeR algorithm at this share point. If other datasets is required, please refer to MoLeR for preprocessing.

Our gene expression data is downloaded from the BiAAE repository. It is also available here.

Training the encoder and decoder

When excuting the training scripts for the first time, additional preprocessing will be done to convert data samples into pytorch_geometric data. This process will take some time.

In addition, remember to change the following variables in the scripts:

In autoencoder/train_guacamol.py and autoencoder/train_l1000.py, raw_moler_trace_dataset_parent_folder and output_pyg_trace_dataset_parent_folder refer to the MoLeR processed data folder and the pytorch_geometric acceptable data folder. Please unzip the trace directory downloaded from the share point, and put it under raw_moler_trace_dataset_parent_folder. The pytorch_geometric data will be automatically stored in output_pyg_trace_dataset_parent_folder.

In autoencoder/train_l1000.py, gene_exp_controls_file_path and gene_exp_tumor_file_path stores the gene expression profiles of control and treated cell lines, and lincs_csv_file_path stores the experiment idx. Please put robust_normalized_controls.npz under gene_exp_controls_file_path, put robust_normalized_tumors.npz under gene_exp_tumor_file_path, and put experiments_filtered.csv under lincs_csv_file_path.

Training unconstrained model on GuacaMol dataset

Train GLDM with WAE loss

cd autoencoder
python train_guacamol.py \
        --layer_type=FiLMConv \
        --model_architecture=aae \
        --gradient_clip_val=0.0 \
        --max_lr=1e-4 --gen_step_drop_probability=0.5 \
        --using_wasserstein_loss --using_gp
  • To use other GNN layer type, change --layer_type values into GATConv or GCNConv.
  • For GLDM with other regularization losses:
    • GLDM + GAN loss: remove --using_wasserstein_loss --using_gp
    • GLDM + VAE loss: remove --using_wasserstein_loss --using_gp and change --model_architecture=vae

Model checkpoints will automatically be saved under the current folder.

Training constrained model on L1000 dataset

python train_l1000.py \
    --layer_type=FiLMConv \
    --model_architecture=aae \
    --gradient_clip_val=0.0 \
    --use_oclr_scheduler \
    --max_lr=1e-4 \
    --using_wasserstein_loss --using_gp \
    --gen_step_drop_probability=0.0
  • To try other GNN layer type or regularization loss, refer to last section to modify the parameters
  • To finetune a pretrained unconstrained model, configure --pretrained_ckpt with the pretrained model checkpoint and --pretrained_ckpt_model_type to be vae or aae accordingly

Training the latent diffusion model

Make sure the encoder model is developed before proceeding to train the latent diffusion model. Configure the path to the encoder model checkpoint in config_file. Also remember to change the path variables likely to training the encoder and decoder.

Training unconstrained model on GuacaMol dataset

cd ldm
python train_ldm_guacamol.py \
    --layer_type=FiLMConv \
    --model_architecture=aae \
    --use_oclr_scheduler \
    --gradient_clip_val=1.0 \
    --max_lr=1e-4 \
    --using_wasserstein_loss --using_gp \
    --gen_step_drop_probability=0.9 \
    --config_file=config/ldm_uncon+wae_uncon.yml

Training constrained model on L1000 dataset

python train_ldm_l1000.py \
    --layer_type=FiLMConv \
    --model_architecture=aae \
    --use_oclr_scheduler \
    --gradient_clip_val=1.0 \
    --max_lr=1e-4 \
    --using_wasserstein_loss --using_gp \
    --gen_step_drop_probability=0.9 \
    --config_file=config/ldm_con+wae_con.yml

Sample hit molecules

Please download the trained models from Zenodo and refer to ldm/sample_ldm.ipynb for sampling from developed models.


Evaluations

GuacaMol benchmarks

python distribution_learning.py \
    --using_ldm \
    --ldm_ckpt=model_ckpt/GLDM_WAE_uncond.ckpt \
    --ldm_config=ldm/config/ldm_uncon+wae_uncon.yml \
    --output_fp=guacamol_latest_ldm_wae_10000.json \
    --smiles_file=guacamol_latest_ldm_wae_10000_smiles.pkl \
    --number_samples=10000

The scores of the benchmarks will be shown at the end of the output file (stored at --output_fp).

Structural similarity

First generate the hit candidates constrained by gene expression changes in the test dataset. python evaluate_ldm_l1000_metrics.py -d cuda -m wae

Then refer to ldm/testset_rediscovery.ipynb for evaluations.

Synthetic accessibility and quantitative estimate of drug-likeness

Refer to ldm/SA_QED.ipynb for details.

Binding affinity

20 protein-ligand complex structures are used as baselines. Meta information is recorded in binding_affinity/protein/metadata.csv.

  • Download the binding complex structures from PDB:

    cd binding_affinity
    python download_pdb.py
    
  • Generate the molecules targeted at the selected proteins.

    cd ldm
    python evaluate_ldm_l1000_metrics.py -d cuda -m wae -b
    
  • Save the 3D conformers of the generated molecules

    cd binding_affinity
    python save_ligand.py -m wae
    
  • Molecular docking with Gnina

    python baseline_docking.py
    python molecule_docking.py -m wae
    

    baseline_docking performs docking for the original molecules and molecule_docking deals with the generated molecules.

Change the parameter -m accordingly if you want to check other models. Refer to binding_affinity/evaluate_gnina.ipynb for analyzing the results.


Citation

@article{10.1093/bib/bbae142,
    author = {Wang, Conghao and Ong, Hiok Hian and Chiba, Shunsuke and Rajapakse, Jagath C},
    title = "{GLDM: hit molecule generation with constrained graph latent diffusion model}",
    journal = {Briefings in Bioinformatics},
    volume = {25},
    number = {3},
    pages = {bbae142},
    year = {2024},
    month = {04},
    issn = {1477-4054},
    doi = {10.1093/bib/bbae142},
    url = {https://doi.org/10.1093/bib/bbae142},
    eprint = {https://academic.oup.com/bib/article-pdf/25/3/bbae142/57160111/bbae142.pdf},
}

About

Repository of the GLDM paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%