ICLR 2023
Authors: Shengchao Liu, Hongyu Guo, Jian Tang
[Project Page] [OpenReview] [ArXiv]
This repository provides the source code for the ICLR'23 paper Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching, with the following task:
- This project explores the geometric representation learning on molecules.
- We consider pure geometric information, i.e., the molecule conformation.
- For pretraining, we consider Molecule3D.
- For fine-tuning (downstream), we consider QM9, MD17, LBA & LEP.
conda create -n Geom3D python=3.7
conda activate Geom3D
conda install -y -c rdkit rdkit
conda install -y numpy networkx scikit-learn
conda install -y -c conda-forge -c pytorch pytorch=1.9.1
conda install -y -c pyg -c conda-forge pyg=2.0.2
pip install ogb==1.2.1
pip install sympy
pip install ase # for SchNet
pip install atom3d # for Atom3D
pip install cffi # for Atom3D
pip install biopython # for Atom3D
pip intall -e .
- For Molecule3D (pretraining) dataset, please run
examples/generate_Molecule3D.py
. The default path isdata/Molecule3D/Molecule3D_1000000
. - For QM9, it is automatically downloaded in pyg class. The default path is
data/molecule_datasets/qm9
. - For MD17, it is automatically downloaded in pyg class. The default path is
data/md17
. - For LBA, please use the following commands:
cd data
mkdir -p lba/raw
mkdir -p lba/processed
cd lba/raw
wget http://www.pdbbind.org.cn/download/PDBbind_v2020_refined.tar.gz
tar -xzvf PDBbind_v2020_refined.tar.gz
wget https://zenodo.org/record/4914718/files/LBA-split-by-sequence-identity-30.tar.gz
tar -xzvf LBA-split-by-sequence-identity-30.tar.gz
mv split-by-sequence-identity-30/indices ../processed/
mv split-by-sequence-identity-30/targets ../processed/
- For LEP, please use the following commands:
cd data
mkdir -p lep/raw
mkdir -p lep/processed
cd lep/raw
wget https://zenodo.org/record/4914734/files/LEP-raw.tar.gz
tar -xzvf LEP-raw.tar.gz
wget https://zenodo.org/record/4914734/files/LEP-split-by-protein.tar.gz
tar -xzvf LEP-split-by-protein.tar.gz
For pretraining, we provide implementations on eight pretraining baselines and our proposed GeoSSL-DDM under the examples
folder:
- Supervised pretraining in
pretrain_Supervised.py
. - Type prediction pretraining in
pretrain_ChargePrediction.py
. - Distance prediction pretraining in
pretrain_DistancePrediction.py
. - Angle prediction pretraining in
pretrain_TorsionAnglePreddiction.py
. - 3D InfoGraph pretraining in
pretrain_3DInfoGraph.py
. - GeoSSL pretraining framework in
pretrain_GeoSSL.py
.- GeoSSL-RR pretraining with argument
--GeoSSL_option=RR
. - GeoSSL-InfoNCE pretraining with argument
--GeoSSL_option=InfoNCE
. - GeoSSL-EBM-NCE pretraining with argument
--GeoSSL_option=EBM-NCE
. - GeoSSL-DDM pretraining (ours) with argument
--GeoSSL_option=DDM
.
- GeoSSL-RR pretraining with argument
The running scripts and corresponding hyper-parameters can be found in scripts/pretrain_baselines
and scripts/pretrain_GeoSSL_DDM
.
The downstream scripts can be found under the examples
folder:
finetune_qm9.py
finetune_md17.py
finetune_lba.py
finetune_lep.py
The running scripts and corresponding hyper-parameters can be found in scripts/finetune
. Note that as a fair comparison, we keep a fixed hyper-parameter set for each downstream task, and the only difference is the pretrained checkpoints.
We provide both the log files and checkpoints for GeoSSL-DDM here. The log files and checkpoints for other baselines will be released in the next version.
Feel free to cite this work if you find it useful to you!
@inproceedings{
liu2023molecular,
title={Molecular Geometry Pretraining with {SE}(3)-Invariant Denoising Distance Matching},
author={Shengchao Liu and Hongyu Guo and Jian Tang},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=CjTHVo1dvR}
}