Abstract: In this work, we introduce ChemBFN, a language model that handles chemistry tasks based on Bayesian flow networks working on discrete data. A new accuracy schedule is proposed to improve the sampling quality by significantly reducing the reconstruction loss. We show evidence that our method is appropriate for generating molecules with satisfied diversity even when a smaller number of sampling steps is used. A classifier-free guidance method is adapted for conditional generation. It is also worthwhile to point out that after generative training, our model can be fine-tuned on regression and classification tasks with the state-of-the-art performance, which opens the gate of building all-in-one models in a single module style.
- [31/07/2024] Paper is available on arxiv.org.
- [21/07/2024] Paper was submitted to arXiv.
You can find example scripts in 📁example folder.
You can find pretrained models in release.
We provide a Python class CSVData
to handle data stored in CSV or similar format containing headers with the following tags:
- smiles or safe or selfies (mandatory): the entities under this tag should be molecule SMILES, SAFE or SELFIES strings. Multiple tags are acceptable.
- value (optional): entities under this tag should be molecular properties or classes. Multiple tags are acceptable and in this case you can tell
CSVData
which value(s) should be loaded by specifyinglabel_idx=[...]
. If a property is not defined, leave it empty and the entity will be automatically masked to torch.inf telling the model that this property is unknown.
@misc{2024chembfn,
title={A Bayesian Flow Network Framework for Chemistry Tasks},
author={Nianze Tao and Minori Abe},
year={2024},
eprint={2407.20294},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.20294},
}