Skip to content

LtECoD/CoFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoFlow

CoFlow is a discrete generative model for protein sequence and structure co-design, as described in our paper: Co-Design Protein Sequence and Structure in Discrete Space via Generative Flow

Dependencies

To run the source code, install the required dependencies:

# Install environment with dependencies.
conda env create -n coflow -f requirements.yaml

# Activate environment
conda activate coflow

CoFlow takes pre-trained structure VQ-VAE of ESM3 for tokenization. Therefore, running inference requires that you have access to the ESM3 weights. Make sure your huggingface account has access to "EvolutionaryScale/esm3-sm-open-v1". Then generate your huggingface access token (check the permissions to repositories) and run the following command to download the weights:

huggingface-cli login   
# input your huggingface access token
huggingface-cli download EvolutionaryScale/esm3-sm-open-v1

Downloaded files are typically located at "~/.cache/huggingface/hub/models--EvolutionaryScale--esm3-sm-open-v1".

Usage

Inference

Note: Download the trained model weights from here, and extract them to the checkpoint directory. As a result, the checkpoint will include four files:

checkpoint
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
└── version

See the notebook example.ipynb for information on how to use CoFlow, which includes examples of both unconditional and conditional generation.

Train

To train the model, you will need to pre-process dataset. Just run the following script:

python source/preprocess.py

Several parameters need to be specified in the script, including:

  • fp_txt: Path to a text file containing PDB file paths
  • meta_fp: Path to store metadata
  • txt_out: Path to save processed data

You can also customize other parameters to control filtering granularity.

The processed dataset consists of two .txt files:

  1. A sequence file, where each line corresponds to a protein sequence.
  2. A structure token file, where each line represents a discrete protein structure, encoded using the VQVAE model in ESM3.

Example:

Sequence line:

908/MGYP003390323908 MKLIITLLLFVSLLPAYAAIMDGNCRDSQGSFRGEIIFREARHTQVVVGIRDRADYLNRGLAITFPRLELSGHKVVAQYSHPHYAGIGSEASRLEFDGALIRLTTLVRNAPNGSFNLSVSCLLDVPRDRQELGRLVREMNTH

Structure line:

908/MGYP003390323908 1035 3954 305 3961 3961 2082 588 3101 588 3109 2439 3227 1763 852 1364 943 3799 3617 3106 177 3705 1220 3892 2520 3683 2945 2886 1805 3013 1862 194 1167 1487 2670 1191 3857 2302 163 3975 2293 1582 3211 322 3737 2446 560 1534 1177 697 794 1179 3994 3023 2983 2816 3148 1033 1395 2556 1712 3949 189 2536 2194 1451 1619 3509 1011 3332 872 1272 3660 3904 2463 3677 1419 767 2269 1399 1179 741 3378 1404 1993 82 1786 1204 795 3052 2452 496 3889 3331 2861 634 1057 978 1186 2781 2989 189 3166 1809 1547 2832 1367 276 4084 1076 2769 800 1480 1862 3721 3538 3362 1785 2081 3556 2557 2259 2756 1713 2331 2780 594 1169 1412 2776 3961 588 778 668 588 2587 1695 2048 1414 2425 2080 2103 969

CoFlow is trained with two datasets:

License

The model and code are released under the Cambrian Open License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published