Using Structure-informed Langugage Models Are Protein Designers

This is the unofficial code of the arxiv paper Structure-informed Language Models Are Protein Designers published by Zhang Zaixiang et al

Summary of implmentation:

We achieve similar performance in base model implementation like ProtMPNN, ProtMPNN-CMLM models. However, our implementation on LM-Design achieves surprising results on CATH4.2 and TS test datset as shown in below Table. Please refer to "results/" together to verify the experimental results.

Code implmentation

To run LM-Design, clone github repo and install corresponding lmdesign.yaml environemnt on your conda.

For yaml installation, please check dependency in yaml files and install with the following command:

conda env create --file lmdesign.yaml

All methods are light weight models so that 1 GPU (V100) is enough. But it still has depency on CPU computing on featurizations on both structure and PLM models.

Code organization:

data/ - directory contains dataset download code and even native pdb files.
helper_scripts/ - helper functions for future works like multi chains parsing and residue fix, adding AA bias, tying residue etc.
models/ - directory of models for LM-Design, Structure adapter, ProteinMPNN, Pifold, ESM models and utilities.
results/ - contains conducted experimental results like ProtMPNN, ProtMPNN-CMLM, LM-Design1, LM-Design2, and LM-Design2.
scripts/ - contains the shell script to reproduce our models
weights/ - directory of base inverse fold model weight e.g, ProteinMPNN
mpnn_train.py - train and validate protMPNN and protMPNN-CMLM models
mpnn_test.py - test protMPNN and protMPNN-CMLM models on CATH4.2 and extra test dataset TS50 and TS500
lmdesign_train.py - train and validate the variate LMDesign models on CATH4.2
lmdesign_test.py - test LM-Design models on CATH4.2 and extra test dataset TS50 and TS500

Data preparation

Main dataset used in this models are CATH4.2 and TS dataset. To install the dataset, go to data and follow command:

./download_preprocessed_dataset.sh

This will automatically download dataset on data directory. This should be done before any training or test.

Trained model weight preparation

Trained ProtMPNN, ProtMPNN-CMLM, LM-Design1 (ProtMPNN-CMLM), LM-Design2 (Pretrained ProtMPNN-CMLM:finetune), LM-Design3 (Pretrained ProtMPNN-CMLM:freeze), ESM weights can be found https://shorturl.at/bopqS. To train the model, ESM weight should be ready.

Reproduce ProtMPNN and ProtMPNN-CMLM models

To test pretrained ProtMPNN and ProtMPNN-CMLM model, please refer to "scripts/mpnn_test.sh". To do so, set mendatory data path in shell scripts :

## set data path if run download_preprocessed_datasets.sh in data directory, don't need to modify here. ##
cath_file='./data/chain_set.jsonl'
cath_splits='./data/chain_set_splits.json'
short_splits='./data/test_split_L100.json'
single_splits='./data/test_split_sc.json'
ts_dir='./data/ts/'

save_dir='./result/mpnn_reproduce/' ## results directory 
## If you are able to access project folder, you can use this code. 
saved_weight='/data/project/rw/LMDesign_weights/mpnn_trained_weight.pt'
## Otherwise, should be defined by the path where you install the weight
saved_weight='' 

# Test trained model on CATH4.2 and TS50 & 500
# #################################################
# ## set models - ProtMPNN
# #################################################
python mpnn_test.py --use_pretrained_weights $saved_weight --out_folder $save_dir --jsonl_path $cath_file --file_splits $cath_splits --test_short_path $short_splits --test_single_path $single_splits --test_ts_directory $ts_dir 

#################################################
## set models - ProtMPNN+CMLM
#################################################
save_dir='./result/mpnn_cmlm_reproduce/'
saved_weight='/data/project/rw/LMDesign_weights/mpnn_cmlm_trained_weight.pt' 

## This is model trained mpnn+cmlm with 4 decoder layer
## If you trained set num_encoder_layers and num_decoder_layers different, use them here. 
python mpnn_test.py --use_pretrained_weights $saved_weight --out_folder $save_dir --jsonl_path $cath_file --file_splits $cath_splits --test_short_path $short_splits --test_single_path $single_splits --test_ts_directory $ts_dir --num_decoder_layers 4

after setting up above, run mpnn_test.sh file. Note that I set parallel execution on protMPNN and protMPNN-CMLM. Both setting should be configured. If not, please modify the code for single run.

To train ProtMPNN model, please refer to "scripts/mpnn.sh" and "mpnn_train.py"

## Define the datapath as mentioned above
out_folder='./results/reproduce_mpnn_test'
cathDataPath='./data/chain_set.jsonl'
splits='./data/chain_set_splits.json'
shortPath='./data/test_split_L100.json'
scPath='./data/test_split_sc.json'

python mpnn_train.py --model MPNN --epoch 100 --out_folder $out_folder --jsonl_path $cathDataPath --file_splits $splits --test_short_path $shortPath --test_single_path $scPath

after setting up above, run mpnn_train.sh file.

To train ProtMPNN-CMLM model, please refer to "scripts/mpnn_cmlm.sh" and "mpnn_train.py"

# Training from pretrained model vanilla ProtMPNN model - v_48_020.pt)) encoder decoder 3 / 3 as same as official model
# python mpnn_train.py --model MPNN_CMLM --epoch 100 --out_folder $out_folder --jsonl_path $cathDataPath --file_splits $splits --test_short_path $shortPath --test_single_path $scPath

## My trained models has 4 decoder layers training from scratch
python mpnn_train.py --model MPNN_CMLM --epoch 100 --out_folder $out_folder --jsonl_path $cathDataPath --file_splits $splits --test_short_path $shortPath --test_single_path $scPath --use_pretrained_weights "" --num_decoder_layers 4

run mpnn_cmlm.sh in script.

Reproduce LM-Design models

LM-Design has three main modules Structure encoder (e.g., protMPNN), Protein Language Model (e.g., ESM1b) and strcuture adapter.

LMDesign1 (ProtMPNN-CMLM)

To test pretrained LM-Design1 model, please refer to "scripts/lmdesign_test.sh" and "lmdesign_test.py" To do so, set mendatory data path in shell scripts :

## set data path as usual as above 
cath_file='./data/chain_set.jsonl'
cath_splits='./data/chain_set_splits.json'
short_splits='./data/test_split_L100.json'
single_splits='./data/test_split_sc.json'
ts_dir='./data/ts/'

save_dir='./results/lmdesign_exp1_reproduce' # save result path 
## If you are able to access project folder, you can use this code. 
saved_weight='/data/project/rw/LMDesign_weights/mpnn_trained_weight.pt'
## Otherwise, should be defined by the path where you install the weight
saved_weight='' 

# #################################################
# ## set models - LM-Design1 test on CATH 4.2 and TS
# #################################################
python lmdesign_test.py --use_pretrained_weights $saved_weight --out_folder $save_dir --embed_dim 1280 --num_heads 10 --structure_model MPNN  --structure_weight "" --jsonl_path $cath_file --file_splits $cath_splits --test_short_path $short_splits --test_single_path $single_splits --test_ts_directory $ts_dir --num_refinement 6

To train the LMDesign1 model, please refer to "script/lmdesign.sh" and "lmdesign_train.py".

## Define dataset 
# out_folder='./results/lmdesign1_reproduce'
# cathDataPath='./data/chain_set.jsonl'
# splits='./data/chain_set_splits.json'
# shortPath='./data/test_split_L100.json'
# scPath='./data/test_split_sc.json'

# ####################################
# ## Single implmenation 
# #####################################
# # Training LMDESIGN on CATH4.2 exp1 
# python lmdesign_train.py --epoch 100 --out_folder $out_folder --jsonl_path $cathDataPath --file_splits $splits --test_short_path $shortPath --test_single_path $scPath --embed_dim 1280 --num_heads 10 --structure_model MPNN --structure_trainable True --num_decoder_layers 4  --structure_weight ""

run lmdesign.sh to train LM-Design1.

LMDesign2 (pretrained ProtMPNN-CMLM: fine-tune)

To test pretrained LMDesign2 model, please refer to "scripts/lmdesign_test.sh" and "lmdesign_test.py" To do so, set mendatory data path in shell scripts :

## set data path as usual as above 
cath_file='./data/chain_set.jsonl'
cath_splits='./data/chain_set_splits.json'
short_splits='./data/test_split_L100.json'
single_splits='./data/test_split_sc.json'
ts_dir='./data/ts/'

save_dir2='./results/lmdesign_exp2_reproduce'
## If you are able to access project folder, you can use this code. 
saved_weight2='/data/project/rw/LMDesign_weights/lmdesign2_trained_weight.pt'
## Otherwise, should be defined by the path where you install the weight
saved_weight2='' 

# #################################################
# ## set models - LM-Design2 test on CATH 4.2 and TS
# #################################################
save_dir2='./results/lmdesign_exp2_reproduce'
saved_weight2='/data/project/rw/LMDesign_weights/lmdesign2_trained_weight.pt' # put path for saved weights
python lmdesign_test.py --use_pretrained_weights $saved_weight2 --out_folder $save_dir2 --embed_dim 1280 --num_heads 10 --structure_model MPNN --structure_weight "" --num_decoder_layers 4 --jsonl_path $cath_file --file_splits $cath_splits --test_short_path $short_splits --test_single_path $single_splits --test_ts_directory $ts_dir --num_refinement 6

Our pretrained protMPNN-CMLM model utlizes --num_decoder_layers 4. This would be depending on how to train the protMPNN-CMLM model. Run the code above.

To train the LMDesign2 model, please refer to "script/lmdesign.sh" and "lmdesign_train.py".

## Define dataset 
# out_folder='./results/lmdesign1_reproduce'
# cathDataPath='./data/chain_set.jsonl'
# splits='./data/chain_set_splits.json'
# shortPath='./data/test_split_L100.json'
# scPath='./data/test_split_sc.json'

# ####################################
# ## Single implmenation 
# #####################################
# # Training LMDESIGN on CATH4.2 exp2
# python lmdesign_train.py --epoch 100 --out_folder $out_folder --jsonl_path $cathDataPath --file_splits $splits --test_short_path $shortPath --test_single_path $scPath --embed_dim 1280 --num_heads 10 --structure_model MPNN --structure_trainable True --structure_weight $pretrained_weights --num_decoder_layers 4

Run the code above. In this experiment, structure encoder and structure adapters are trainable and PLM are frozen.

LMDesign3 (pretrained ProtMPNN-CMLM: freeze)

To test pretrained LMDesign3 model, please refer to "scripts/lmdesign_test.sh" and "lmdesign_test.py" To do so, set mendatory data path in shell scripts :

## set data path as usual as above 
cath_file='./data/chain_set.jsonl'
cath_splits='./data/chain_set_splits.json'
short_splits='./data/test_split_L100.json'
single_splits='./data/test_split_sc.json'
ts_dir='./data/ts/'

save_dir3='/data/private/LMDESIGN/test/lmdesign_exp3_results'
## If you are able to access project folder, you can use this code. 
saved_weight3='/data/private/LMDESIGN/test/lmdesign_exp3_results'
## Otherwise, should be defined by the path where you install the weight
saved_weight3='' 

# #################################################
# ## set models - LM-Design3 test on CATH 4.2 and TS
# #################################################

python lmdesign_test.py --use_pretrained_weights $saved_weight3 --out_folder $save_dir3 --embed_dim 1280 --num_heads 10 --structure_model MPNN --structure_weight "" --num_decoder_layers 4 --jsonl_path $cath_file --file_splits $cath_splits --test_short_path $short_splits --test_single_path $single_splits --test_ts_directory $ts_dir --num_refinement 6

Our pretrained protMPNN-CMLM model utlizes --num_decoder_layers 4. This would be depending on how pretrain the protMPNN-CMLMmodel. Run the code above.

To train the LMDesign3 model, please refer to "script/lmdesign.sh" and "lmdesign_train.py".

## Define dataset 
# out_folder='./results/lmdesign1_reproduce'
# cathDataPath='./data/chain_set.jsonl'
# splits='./data/chain_set_splits.json'
# shortPath='./data/test_split_L100.json'
# scPath='./data/test_split_sc.json'

# ####################################
# ## Single implmenation 
# #####################################
# # Training LMDESIGN on CATH4.2 exp3
# python lmdesign_train.py --epoch 100 --out_folder$out_folder --jsonl_path $cathDataPath --file_splits $splits --test_short_path $shortPath --test_single_path $scPath --embed_dim 1280 --num_heads 10 --structure_model MPNN --structure_trainable False --structure_weight $pretrained_weights --num_decoder_layers 4

Run the code above. In this experiment, LMDesign3 trains only structure adapter.

I strongly recommend to use parallel execution just add "&" in the scripts. This is because the trainable weights are relatively small. To maximize utilization of GPU, please process_train1 & process_train2. For the details of it, please refer to "./scripts/lmdesign.sh

If you struggle with reproducing on my code, please contact me via email (구일kthan 엣 gmail.com). Or any verification or feedback on my code will welcome :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Structure-informed Langugage Models Are Protein Designers

Summary of implmentation:

Code implmentation

Data preparation

Trained model weight preparation

Reproduce ProtMPNN and ProtMPNN-CMLM models

Reproduce LM-Design models

LMDesign1 (ProtMPNN-CMLM)

LMDesign2 (pretrained ProtMPNN-CMLM: fine-tune)

LMDesign3 (pretrained ProtMPNN-CMLM: freeze)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
helper_scripts		helper_scripts
models		models
results		results
scripts		scripts
weights/vanilla_model_weights		weights/vanilla_model_weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lmdesign.yaml		lmdesign.yaml
lmdesign_test.py		lmdesign_test.py
lmdesign_train.py		lmdesign_train.py
mpnn_test.py		mpnn_test.py
mpnn_train.py		mpnn_train.py

License

Han00127/Prot-Sequence-generation

Folders and files

Latest commit

History

Repository files navigation

Using Structure-informed Langugage Models Are Protein Designers

Summary of implmentation:

Code implmentation

Data preparation

Trained model weight preparation

Reproduce ProtMPNN and ProtMPNN-CMLM models

Reproduce LM-Design models

LMDesign1 (ProtMPNN-CMLM)

LMDesign2 (pretrained ProtMPNN-CMLM: fine-tune)

LMDesign3 (pretrained ProtMPNN-CMLM: freeze)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages