Skip to content

Latest commit

 

History

History
45 lines (29 loc) · 2.6 KB

README.md

File metadata and controls

45 lines (29 loc) · 2.6 KB

Data-Centric Learning from Unlabeled Graphs with Diffusion Model

This is the code for DCT, a data-centric transfer learning framework with diffusion model on graphs. This data-centric approach avoids the inappropriate designs of hand-crafted self-supervised tasks and transfer knowledge by data augmentation. Please refer to our paper (accepted by NeurIPS 2023): https://arxiv.org/abs/2303.10108 for more details.

parameter centric versus data centric

Requirements

This code was developed and tested with Python 3.7.12, PyTorch 1.13.0, and PyG 2.1.0. All dependencies are specified in the requirements.txt file.

Usage

Learning from unlabeled graphs

We do not include the code to learn from the unlabeled graphs. The diffusion model could be any pre-trained diffusion based graph generator. We provide a well-trained model in the path: checkpoints/qm9_denoise.pth. If readers want to train the diffusion model from the scratch, we suggest following the codes from GDSS. Our code should be compatible with any continuous-state graph diffusion model trained on QM9 dataset. For the graph diffusion model trained on other datasets (e.g., ZINC-250K), modifications for the enbale_index variable in the convert_sparse_to_dense function from the utils.mol_utils.py are needed.

Generating task-specific labeled graphs

Implementation of data-centric approach on graph with diffusion model

Following is an example command to run experiments on molecules' and polymers' classification and regression datasets.

# OGBG-SIDER
python main.py --dataset ogbg-molsider

The dataset name can be any of ['plym-density', 'plym-oxygen','plym-melting', 'plym-glass', 'ogbg-mollipo', 'ogbg-molfreesolv', 'ogbg-molesol', 'ogbg-molhiv', 'ogbg-molbace', 'ogbg-molbbbp', 'ogbg-molclintox','ogbg-molsider','ogbg-moltox21','ogbg-moltoxcast']

We use n_jobs=22 by default to covnert the generated molecules to the PyG-style data objectives with 22 processes. Please adjust this hyper-parameter in accordance with the available CPU cores.

Citation

If you find this repository useful, please cite our paper:

@article{liu2023data,
  title={Data-Centric Learning from Unlabeled Graphs with Diffusion Model},
  author={Liu, Gang and Inae, Eric and Zhao, Tong and Xu, Jiaxin and Luo, Tengfei and Jiang, Meng},
  journal={arXiv preprint arXiv:2303.10108},
  year={2023}
}