GitHub - Westlake-AI/FoldToken_open

FoldToken: A generative protein structure language!

Table of Contents

About The Project
Getting Started
Usage
Downstream Tasks
Dataset
License
Contact
Citation

About The Project

This project aims to create protein structure language for unifing the modality of protein sequence and structure. Here we provide the open-source code of FoldToken4 for research purpose. Welcome researchers to use or contribute to this project! We know that the structure language is not perfect for all downstream tasks now, we need more feedback to improve it further.

Here's why we introduce structure language:

Unification: If we can convert data of any modality to a sequence representation, we can use the unified transformer model to solve any problem in protein modeling.
Simplification: Structure modeling generally requires complex and inefficient model design. In comparison, the highly optimized transformer will be more suitable for scalling up.

(back to top)

Getting Started

conda env create -f environment.yml

Usage

Reconstruct Protein Structures

export PYTHONPATH=project_path

CUDA_VISIBLE_DEVICES=0 python foldtoken/reconstruct.py --path_in ./N128 --path_out ./N128_pred --level 8

One can use this script to validate the reconstruction performance of FoldToken4. The molel will encode input pdbs in path_in, reconstruct them, and save the reconstructed structures to path_out. Users can specify config and checkpoint to select appropriate model. The codebook size is $2^{level}$, i.e., $2^{8}$ in the example.


8ybxR	8vy8E

8ybxR: [222, 120, 78, 191, 184, 3, 190, 182, 182, 4, 51, 254, 210, 252, 72, 188, 121, 86, 188, 236, 236, 237, 24, 195, 47, 248, 247, 192, 74, 79, 82, 27, 199, 167, 170, 70, 45, 32, 215, 14, 14, 254, 254, 59, 38, 166, 115, 98, 53, 1, 106, 79, 79, 79, 166, 240, 181, 162, 179, 96, 16, 69, 211, 112, 113, 197, 49, 56, 246, 122, 214, 119, 50, 252, 51, 51, 171, 151, 41, 185, 207, 216, 153, 243]

8vy8E: [13, 186, 190, 211, 51, 178, 252, 119, 50, 103, 112, 6, 185, 190, 228, 3, 81, 139, 139, 116, 127, 242, 242, 182, 251, 38, 195, 195, 244, 86, 225, 44, 250, 180, 227, 39, 57, 142, 237, 49, 251, 51, 190, 26, 88, 139, 218, 2, 239, 43, 43, 215, 124, 60, 205, 195, 98, 166, 1, 242, 127, 191, 102, 41, 240, 211, 54, 19, 219, 194, 113, 16, 179, 162]

Batch tokenizing structures

export PYTHONPATH=project_path

CUDA_VISIBLE_DEVICES=0 python extract_vq_ids.py --path_in ./N128 --save_vqid_path ./N128_vqid.jsonl --level 8

CUDA_VISIBLE_DEVICES=0 python extract_vq_ids_jsonl.py --path_in ./pdb.jsonl --save_vqid_path ./N128_vqid.jsonl --level 8

One can use following script to extract vq ids from pdbs in path_in, and save it to path_out. Users can specify config and checkpoint to select appropriate model. The codebook size is $2^{level}$, i.e., $2^{8}$ in the example.

(back to top)

Downstream Tasks

Dataset & Model

Dataset	Link	Samples	Comments
PDB	Download	162,118	Used for pretraining, Multi-chain data
CATH4.3	Download	22,508	Single-chain Data
N128	Download	128	Single-chain Data, for evaluation
T116	Download	493	Single-chain Data, for evaluation
T493	Download	493	Single-chain Data, for evaluation
M1031	Download	1031	Protein Complex Data, for evaluation

Model	Link
FoldToken4	ckpt

License

Distributed under the Apache 2.0 license License. See LICENSE.txt for more information.

(back to top)

Contact

Zhangyang Gao - gaozhangyang@westlake.edu.cn

(back to top)

Citation

If you are interested in our repository or our paper, please cite the following paper:

@article{gao2023vqpl,
  title={Vqpl: Vector quantized protein language},
  author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
  journal={arXiv preprint arXiv:2310.04985},
  year={2023}
}

@article{gao2024foldtoken,
  title={Foldtoken: Learning protein language via vector quantization and beyond},
  author={Gao, Zhangyang and Tan, Cheng and Wang, Jue and Huang, Yufei and Wu, Lirong and Li, Stan Z},
  journal={arXiv preprint arXiv:2403.09673},
  year={2024}
}
@article{gao2024foldtoken2,
  title={FoldToken2: Learning compact, invariant and generative protein structure language},
  author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
  journal={bioRxiv},
  pages={2024--06},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}
@article{gao2024foldtoken3,
  title={FoldToken3: Fold Structures Worth 256 Words or Less},
  author={Gao, Zhangyang and Tan, Chen and Li, Stan Z},
  journal={bioRxiv},
  pages={2024--07},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}
@article{gao2024foldtoken4,
  title={FoldToken4: Consistent \& Hierarchical Fold Language},
  author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
  journal={bioRxiv},
  pages={2024--08},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
foldseek		foldseek
foldtoken		foldtoken
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FoldToken: A generative protein structure language!

About The Project

Getting Started

Usage

Reconstruct Protein Structures

Batch tokenizing structures

Downstream Tasks

Dataset & Model

License

Contact

Citation

About

Releases

Packages

Languages

Westlake-AI/FoldToken_open

Folders and files

Latest commit

History

Repository files navigation

FoldToken: A generative protein structure language!

About The Project

Getting Started

Usage

Reconstruct Protein Structures

Batch tokenizing structures

Downstream Tasks

Dataset & Model

License

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages