Skip to content

1anj/ChemHyperMag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChemHyperMag

Official PyTorch Implementation of "Magnetic Laplacian-based Hypergraph Contrastive Learning for Molecular ADMET Predictions"

ChemHyperMag is a deep learning framework designed for molecular ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) property prediction. By combining functional-group hypergraph representations, chemically biased Magnetic Laplacian graph convolutions, and multi-task contrastive learning, ChemHyperMag achieves state-of-the-art accuracy and robust generalization on diverse drug discovery benchmarks.


Abstract

Accurate prediction of Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties is crucial for lowering attrition rates in drug discovery. However, standard molecular graph neural networks often struggle to represent complex functional motifs, directed charge/electronegativity flow, and multi-task dependencies.

We present ChemHyperMag, a multi-task graph neural network framework built on Magnetic Laplacian Hypergraph Contrastive Learning. Small molecules are mapped to a star-expansion hypergraph incorporating SSSR rings, BRICS retrosynthetic fragments, and Bemis-Murcko scaffolds as functional hyperedges. To model directional chemical flows, we introduce a Markov transition operator biased by atomic electronegativity and Gasteiger charges, producing an asymmetric flow matrix. We compute the normalized complex Magnetic Laplacian of this flow matrix, which encodes directionality in its complex phases. ChemHyperMag employs spectral Chebyshev convolutions on this Laplacian, aggregates node representations via task-specific attention-readouts, fuses auxiliary task features with a primary-task gating network, and employs molecular-level contrastive learning using phase/feature perturbations. Evaluations show ChemHyperMag outperforms standard baselines across 24 classification/regression benchmarks in ADMET and TDC-ADMET.


Key Contributions

  1. Functional-Group Hypergraph Representation: Encodes molecules beyond simple pairwise bonds, treating rings, scaffolds, and BRICS synthetic fragments as hyperedges to preserve local chemical context.
  2. Chemically Biased Magnetic Laplacian: Biases graph transition probabilities using physical descriptors (electronegativity and charge), generating a complex-valued Magnetic Laplacian that represents asymmetric chemical fields.
  3. Primary Task-Centered Gating & Multi-Task Contrastive Learning: Implements dynamic auxiliary task selection and feature gating to optimize multi-task predictions while aligning molecular views via phase-perturbed contrastive learning (InfoNCE).

Project Structure

ChemHyperMag/
├── train.py                    # Main training script with DDP support
├── train.sh                    # Bash script for running training across 24 ADMET tasks
├── create_graph_data.py        # Preprocessing script to parse CSV data into cached DGL binary format
├── benchmark_runtime.py        # Preprocessing, training, and end-to-end benchmarking script
│
├── datasets/                   # Dataset construction and utilities
│   ├── __init__.py
│   ├── data_prepare.py         # Graph construction and split management
│   ├── hyperedge_constructor.py # Functional group hyperedge constructors
│   ├── utils.py                # Electronegativity lookup and atom features
│   ├── admet.csv               # ADMET dataset
│   └── tdc_admet_all.csv       # TDC-ADMET dataset
│
├── experiments/                # GNN architectures, losses, and utilities
│   ├── __init__.py
│   ├── model.py                # Main ChemHyperMag model
│   ├── modules.py              # Magnetic Laplacian and GNN layers
│   ├── chebnet.py              # ChebNetII layers
│   ├── losses.py               # InfoNCE and Automatic Weighted Loss
│   ├── parameters.py           # Evaluation meters and training steps
│   ├── utils.py                # Parameter logging utilities
│   └── visualization.py        # Phase flows and subgraph attributions
│
├── result/                     # Logs and training results
└── checkpoints/                # Model checkpoint directory

Environments

1. Requirements

  • CUDA 11.6+
  • Linux (Ubuntu 18.04+) or macOS
  • Python 3.9+

2. Installation & Conda Environment Setup

# Create conda environment
conda create -n chemhypermag python=3.9 -y
conda activate chemhypermag

# Install PyTorch and DGL (adjust CUDA version if necessary)
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install dgl-cu116 -f https://data.dgl.ai/wheels/repo.html

# Install core scientific dependencies
pip install scikit-learn pandas numpy scipy tqdm mendeleev matplotlib networkx

# Install chemistry libraries
pip install rdkit

Downstream Task Training

Dataset CSV Format

The input CSV files (e.g., datasets/admet.csv) should contain the following columns:

  • smiles: SMILES representation of the molecule
  • group: Train/validation/test split label (training, valid, test)
  • Task Columns (e.g., HIA, OB, BBB): Binary labels (0 or 1) or continuous values. Missing values should be left empty or filled with standard placeholders.

Example:

smiles,group,HIA,OB,BBB
CC(=O)Oc1ccccc1C(=O)O,training,1,0,1

Preprocessing and Caching

First, preprocess the raw datasets to build and cache the molecular hypergraph structures:

python create_graph_data.py

This script saves the preprocessed graphs as binary files (admet.bin, tdc_admet_all.bin) and records splits in separate group CSVs.

Training Usage

To train the model:

usage: train.py [-h] [--dataset {admet,tdc_admet_all}] [--bin_path BIN_PATH]
                [--group_path GROUP_PATH] [--select_task_list SELECT_TASK_LIST [SELECT_TASK_LIST ...]]
                [--primary_task_index PRIMARY_TASK_INDEX] [--primary_task_weight PRIMARY_TASK_WEIGHT]
                [--use_primary_centered_gate USE_PRIMARY_CENTERED_GATE] [--device DEVICE] [--lr LR]
                [--weight_decay WEIGHT_DECAY] [--num_epochs NUM_EPOCHS] [--patience PATIENCE]
                [--batch_size BATCH_SIZE] [--in_feats IN_FEATS] [--use_hypergraph USE_HYPERGRAPH]
                [--hyperedge_in_feats HYPEREDGE_IN_FEATS] [--hidden_feats HIDDEN_FEATS]
                [--molecular_embedding_dim MOLECULAR_EMBEDDING_DIM] [--output_feats OUTPUT_FEATS]
                [--dropout DROPOUT] [--classifier_hidden_feats CLASSIFIER_HIDDEN_FEATS]
                [--use_chebnet USE_CHEBNET] [--chebnet_K CHEBNET_K]
                [--use_dynamic_task_selection USE_DYNAMIC_TASK_SELECTION]
                [--similarity_threshold SIMILARITY_THRESHOLD] [--min_aux_tasks MIN_AUX_TASKS]
                [--use_magnetic USE_MAGNETIC] [--magnetic_q MAGNETIC_Q] [--teleport_tau TELEPORT_TAU]
                [--use_contrastive USE_CONTRASTIVE] [--contrastive_weight CONTRASTIVE_WEIGHT]
                [--contrastive_temperature CONTRASTIVE_TEMPERATURE] [--q_perturb_ratio Q_PERTURB_RATIO]
                [--feature_dropout_aug FEATURE_DROPOUT_AUG] [--distributed DISTRIBUTED]
                [--local_rank LOCAL_RANK]

Run Training on a Single GPU:

python train.py \
    --dataset admet \
    --primary_task_index 4 \
    --use_magnetic True \
    --use_contrastive True \
    --num_epochs 200 \
    --batch_size 128 \
    --lr 1e-3

Distributed Multi-GPU Training (DDP):

Run training using PyTorch's torchrun module over multiple GPUs:

torchrun --nproc_per_node=8 train.py \
    --primary_task_index 4 \
    --distributed True \
    --use_primary_centered_gate True \
    --use_contrastive True \
    --num_epochs 200 \
    --batch_size 128

Or run the provided training script:

chmod +x train.sh
./train.sh

Runtime Benchmarking

Benchmark the preprocessing overhead, cached training runtime, and end-to-end training runtime of baseline 2D graphs vs. our functional-group hypergraphs:

python benchmark_runtime.py \
    --dataset admet \
    --cached_train_epochs 5 \
    --end_to_end_train_epochs 5

This generates LaTeX table files and a JSON report summarizing the computational overhead and training acceleration.


Visualization and Interpretability

Run the visualization suite to analyze phase flow distributions, functional group circulation, and task-specific subgraph attributions:

python experiments/visualization.py \
    --checkpoint_path checkpoints/admet_early_stop_XXX.pth \
    --bin_path ./datasets/admet.bin \
    --group_path ./datasets/admet_group.csv \
    --output_dir ./visualization_results

This generates:

  1. phase_distribution_histogram.pdf: Histogram of complex phases ($\Theta_e$) computed across the test set.
  2. circulation_Carboxyl.pdf, circulation_Hydroxyl.pdf: Inferred circulation flow vectors across selected functional groups.
  3. attribution_BBB.png: Map highlighting subgraphs and hyperedges determined critical to Blood-Brain Barrier (BBB) permeability prediction via gradient attribution.

About

Magnetic Laplacian based Hypergraph Contrastive Learning for Molecular ADMET Predictions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors