GitHub - liugangcode/torch-molecule: torch-molecule is a deep learning package for molecular discovery, designed with an sklearn-style interface for property prediction, inverse design and representation learning.

Deep learning for molecular discovery with a simple sklearn-style interface

torch-molecule is a package that facilitates molecular discovery through deep learning, featuring a user-friendly, sklearn-style interface. It includes model checkpoints for efficient deployment and benchmarking across a range of molecular tasks. The package focuses on three main components: Predictive Models, Generative Models, and Representation Models, which make molecular AI models easy to implement and deploy.

See the List of Supported Models section for all available models.

Installation

Create a Conda environment:

conda create --name torch_molecule python=3.11.7
conda activate torch_molecule

Install using pip (0.1.2):
```
pip install torch-molecule
```

Install from source for the latest version:

Clone the repository:

git clone https://github.com/liugangcode/torch-molecule
cd torch-molecule

Install:

pip install .

Additional Packages

Model	Required Packages
HFPretrainedMolecularEncoder	transformers
BFGNNMolecularPredictor	torch-scatter
GRINMolecularPredictor	torch-scatter

For models that require torch-scatter: Install using the following command: pip install torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html, e.g.,

pip install torch-scatter -f https://data.pyg.org/whl/torch-2.7.1+cu128.html

For models that require transformers: pip install transformers

Usage

More examples can be found in the examples and tests folders.

torch-molecule supports applications in broad domains from chemistry, biology, to materials science. To get started, you can load prepared datasets from torch_molecule.datasets (updated after v0.1.3):

Dataset	Description	Function
qm9	Quantum chemical properties (DFT level)	`load_qm9`
chembl2k	Bioactive molecules with drug-like properties	`load_chembl2k`
broad6k	Bioactive molecules with drug-like properties	`load_broad6k`
toxcast	Toxicity of chemical compounds	`load_toxcast`
admet	Chemical absorption, distribution, metabolism, excretion, and toxicity	`load_admet`
gasperm	Six gas permeability properties for polymeric materials	`load_gasperm`

from torch_molecule.datasets import load_qm9

# local_dir is the local path where the dataset will be saved
smiles_list, property_np_array = load_qm9(local_dir='torchmol_data')

# len(smiles_list): 133885
# Property array shape: (133885, 1)

# load_qm9 returns the target "gap" by default, but you can adjust it by passing new target_cols
target_cols = ['homo', 'lumo', 'gap']
smiles_list, property_np_array = load_qm9(local_dir='torchmol_data', target_cols=target_cols)

(We welcome your suggestions and contributions on your datasets!)

Fit a Model

After preparing the dataset, we can easily fit a model similar to how we use sklearn (actually, the coding is even simpler than sklearn, as we still need to do feature engineering in sklearn to convert molecule SMILES into vectors):

from torch_molecule import GREAMolecularPredictor

split = int(0.8 * len(smiles_list))

grea = GREAMolecularPredictor(
    num_task=num_task,
    task_type="regression",
    evaluate_higher_better=False,
    verbose=True
)

# Fit with automatic hyperparameter tuning with 10 attempts, or implement .fit() with the default/manual hyperparameters
grea.autofit(
    X_train=smiles_list[:split],
    y_train=property_np_array[:split],
    X_val=smiles_list[split:],
    y_val=property_np_array[split:],
    n_trials=10,
)

Checkpoints

torch-molecule provides checkpoint functions that can be interacted with on Hugging Face:

from torch_molecule import GREAMolecularPredictor

repo_id = "user/repo_id"  # replace with your own Hugging Face username and repo_id

# Save the trained model to Hugging Face
grea.save_to_hf(
    repo_id=repo_id,
    task_id="qm9_grea",
    commit_message="Upload qm9_grea",
    private=False
)

# Load a pretrained checkpoint from Hugging Face
model = GREAMolecularPredictor()
model.load_from_hf(repo_id=repo_id, local_cache=f"{model_dir}/GREA_{task_name}.pt")

# Adjust model parameters and make predictions
model.set_params(verbose=False)
predictions = model.predict(smiles_list)

Or you can save the model to a local path:

grea.save_to_local("qm9_grea.pt")

new_model = GREAMolecularPredictor()
new_model.load_from_local("qm9_grea.pt")

List of Supported Models

Predictive Models

Model	Reference
GRIN	Learning Repetition-Invariant Representations for Polymer Informatics. May 2025
BFGNN	Graph neural networks extrapolate out-of-distribution for shortest paths. March 2025
SGIR	Semi-Supervised Graph Imbalanced Regression. KDD 2023
GREA	Graph Rationalization with Environment-based Augmentations. KDD 2022
DIR	Discovering Invariant Rationales for Graph Neural Networks. ICLR 2022
SSR	SizeShiftReg: a Regularization Method for Improving Size-Generalization in Graph Neural Networks. NeurIPS 2022
IRM	Invariant Risk Minimization (2019)
RPGNN	Relational Pooling for Graph Representations. ICML 2019
GNNs	Graph Convolutional Networks. ICLR 2017 and Graph Isomorphism Network. ICLR 2019
Transformer (SMILES)	Transformer (Attention is All You Need. NeurIPS 2017) based on SMILES strings
LSTM (SMILES)	Long short-term memory (Neural Computation 1997) based on SMILES strings

Generative Models

Model	Reference
Graph DiT	Graph Diffusion Transformers for Multi-Conditional Molecular Generation. NeurIPS 2024
DiGress	DiGress: Discrete Denoising Diffusion for Graph Generation. ICLR 2023
GDSS	Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations. ICML 2022
MolGPT	MolGPT: Molecular Generation Using a Transformer-Decoder Model. Journal of Chemical Information and Modeling 2021
JTVAE	Junction Tree Variational Autoencoder for Molecular Graph Generation. ICML 2018.
GraphGA	A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of Median Molecules. Journal of Chemical Information and Computer Sciences 2004
LSTM (SMILES)	Long short-term memory (Neural Computation 1997) based on SMILES strings

Representation Models

Model	Reference
MoAMa	Motif-aware Attribute Masking for Molecular Graph Pre-training. LoG 2024
GraphMAE	GraphMAE: Self-Supervised Masked Graph Autoencoders. KDD 2022
AttrMasking	Strategies for Pre-training Graph Neural Networks. ICLR 2020
ContextPred	Strategies for Pre-training Graph Neural Networks. ICLR 2020
EdgePred	Strategies for Pre-training Graph Neural Networks. ICLR 2020
InfoGraph	InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. ICLR 2020
Supervised	Supervised pretraining
Pretrained	GPT2-ZINC-87M: GPT-2 based model (87M parameters) pretrained on ZINC dataset with ~480M SMILES strings. RoBERTa-ZINC-480M: RoBERTa based model (102M parameters) pretrained on ZINC dataset with ~480M SMILES strings. UniKi/bert-base-smiles: BERT model pretrained on SMILES strings. ChemBERTa-zinc-base-v1: RoBERTa model pretrained on ZINC dataset with ~100k SMILES strings. ChemBERTa series: Available in multiple sizes and training objectives (MLM/MTR). ChemBERTa-5M-MLM, ChemBERTa-5M-MTR, ChemBERTa-10M-MLM, ChemBERTa-10M-MTR, ChemBERTa-77M-MLM, ChemBERTa-77M-MTR. ChemGPT series: GPT-Neo based models pretrained on PubChem10M dataset with SELFIES strings. ChemGPT-1.2B, ChemGPT-4.7B, ChemGPT-19B.

Acknowledgements

The project template was adapted from https://github.com/lwaekfjlk/python-project-template. We thank the authors for their contribution to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
.github/workflows		.github/workflows
assets		assets
data		data
docs		docs
examples		examples
tests		tests
torch_molecule		torch_molecule
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Additional Packages

Usage

Fit a Model

Checkpoints

List of Supported Models

Predictive Models

Generative Models

Representation Models

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Languages

License

liugangcode/torch-molecule

Folders and files

Latest commit

History

Repository files navigation

Installation

Additional Packages

Usage

Fit a Model

Checkpoints

List of Supported Models

Predictive Models

Generative Models

Representation Models

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Languages

Packages