Table of Contents
This project aims to create protein structure language for unifing the modality of protein sequence and structure. Here we provide the open-source code of FoldToken4 for research purpose. Welcome researchers to use or contribute to this project! We know that the structure language is not perfect for all downstream tasks now, we need more feedback to improve it further.
Here's why we introduce structure language:
- Unification: If we can convert data of any modality to a sequence representation, we can use the unified transformer model to solve any problem in protein modeling.
- Simplification: Structure modeling generally requires complex and inefficient model design. In comparison, the highly optimized transformer will be more suitable for scalling up.
conda env create -f environment.yml
export PYTHONPATH=project_path
CUDA_VISIBLE_DEVICES=0 python foldtoken/reconstruct.py --path_in ./N128 --path_out ./N128_pred --level 8
One can use this script to validate the reconstruction performance of FoldToken4. The molel will encode input pdbs in path_in
, reconstruct them, and save the reconstructed structures to path_out
. Users can specify config
and checkpoint
to select appropriate model. The codebook size is
8ybxR | 8vy8E |
8ybxR: [222, 120, 78, 191, 184, 3, 190, 182, 182, 4, 51, 254, 210, 252, 72, 188, 121, 86, 188, 236, 236, 237, 24, 195, 47, 248, 247, 192, 74, 79, 82, 27, 199, 167, 170, 70, 45, 32, 215, 14, 14, 254, 254, 59, 38, 166, 115, 98, 53, 1, 106, 79, 79, 79, 166, 240, 181, 162, 179, 96, 16, 69, 211, 112, 113, 197, 49, 56, 246, 122, 214, 119, 50, 252, 51, 51, 171, 151, 41, 185, 207, 216, 153, 243]
8vy8E: [13, 186, 190, 211, 51, 178, 252, 119, 50, 103, 112, 6, 185, 190, 228, 3, 81, 139, 139, 116, 127, 242, 242, 182, 251, 38, 195, 195, 244, 86, 225, 44, 250, 180, 227, 39, 57, 142, 237, 49, 251, 51, 190, 26, 88, 139, 218, 2, 239, 43, 43, 215, 124, 60, 205, 195, 98, 166, 1, 242, 127, 191, 102, 41, 240, 211, 54, 19, 219, 194, 113, 16, 179, 162]
export PYTHONPATH=project_path
CUDA_VISIBLE_DEVICES=0 python extract_vq_ids.py --path_in ./N128 --save_vqid_path ./N128_vqid.jsonl --level 8
CUDA_VISIBLE_DEVICES=0 python extract_vq_ids_jsonl.py --path_in ./pdb.jsonl --save_vqid_path ./N128_vqid.jsonl --level 8
One can use following script to extract vq ids from pdbs in path_in
, and save it to path_out
. Users can specify config
and checkpoint
to select appropriate model. The codebook size is
- Structure Generation (struct->struct)
- Unconditional Generation
- Inpainiting & Scaffolding
- Binder Design
- Inverse Folding (struct->seq)
- Protein Folding (seq->struct)
- Single-chain Folding
- MSA Folding
- Function Prediction (struct->Func)
Dataset | Link | Samples | Comments |
---|---|---|---|
PDB | Download | 162,118 | Used for pretraining, Multi-chain data |
CATH4.3 | Download | 22,508 | Single-chain Data |
N128 | Download | 128 | Single-chain Data, for evaluation |
T116 | Download | 493 | Single-chain Data, for evaluation |
T493 | Download | 493 | Single-chain Data, for evaluation |
M1031 | Download | 1031 | Protein Complex Data, for evaluation |
Model | Link |
---|---|
FoldToken4 | ckpt |
Distributed under the Apache 2.0 license License. See LICENSE.txt
for more information.
Zhangyang Gao - gaozhangyang@westlake.edu.cn
If you are interested in our repository or our paper, please cite the following paper:
@article{gao2023vqpl,
title={Vqpl: Vector quantized protein language},
author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
journal={arXiv preprint arXiv:2310.04985},
year={2023}
}
@article{gao2024foldtoken,
title={Foldtoken: Learning protein language via vector quantization and beyond},
author={Gao, Zhangyang and Tan, Cheng and Wang, Jue and Huang, Yufei and Wu, Lirong and Li, Stan Z},
journal={arXiv preprint arXiv:2403.09673},
year={2024}
}
@article{gao2024foldtoken2,
title={FoldToken2: Learning compact, invariant and generative protein structure language},
author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
journal={bioRxiv},
pages={2024--06},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
@article{gao2024foldtoken3,
title={FoldToken3: Fold Structures Worth 256 Words or Less},
author={Gao, Zhangyang and Tan, Chen and Li, Stan Z},
journal={bioRxiv},
pages={2024--07},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
@article{gao2024foldtoken4,
title={FoldToken4: Consistent \& Hierarchical Fold Language},
author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
journal={bioRxiv},
pages={2024--08},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}