This repository holds inference and training code for ProteinDPO (Protein Direct Preference Optimization), a preference optimized structure-conditioned protein language model based on ESM-IF1. We describe ProteinDPO in the paper “Aligning Protein Generative Models with Experimental Fitness via Direct Preference Optimization”.
- Clone this repository:
git clone https://github.com/evo-design/protein-dpo.git
- Navigate to the repository directory:
cd protein-dpo
- Use conda and pip to install required dependencies:
Use the environment.yml
file provided in this repository to create and activate a Conda environment with all the necessary dependencies.
conda env create -f environment.yml
conda activate <environment_name>
Pip install the most recent esm package from the Github repository.
pip install git+https://github.com/facebookresearch/esm.git
- Download Model Weights
Download Protein DPO model weights from the Zenodo Repository and instert them in the weights
folder.
Download vanilla ESM-IF1 model weights within the weights
directory with the following commands:
cd weights/
wget https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt
Sampling is simply a slightly modified script from the ESM-IF1 github. Note, stabilization of any protein backbone with ProteinDPO is not guaranteed to preserve its function, thus we strongly recommend functional or heavily conserved residues be preserved with the --fixed_pos
argument.
- Run The Sampling Script
python sample.py --pdbfile <path_to_input_pdb> --weights_path <path_to_model_weights> [additional_arguments]
If no weights_path
is provided the scripts defaults to the vanilla model weights.
Additional arguments:
--temperature: sampling temperature, lower temperature sampling will have lower diversity
--outpath: path for sampled sequence output
--num-samples: desired number of samples
--fixed_pos: positions to fix for sampling, first residue is 1 not 0
- Prepare your dataset:
aa_seq : Amino acid sequence of mutant variant
WT_Name : Path to the native PDB file
<feature> : Scalar label of the feature for optimization
wt_seq: Amino acid sequence of the native sequence
mut_type: string of mutation (eg. <native_aa><pos><mutant_aa>), separate simulatenous mutations with colons (eg. <native_aa><pos><mutant_aa>:<native_aa><pos><mutant_aa>:... etc.)
The file fireprot_homologue_free.csv
is provided as an example. Note to score this csv, PDB files need to be downloaded and the 'WT_Name' column updated with their respective paths.
- Run Scoring Script
python score.py --dataset_path <path_to_sequences_csv> --weights_path <path_to_model_weights> [additional_arguments]
Replace <path_to_model_weights>
with the path to the trained protein-dpo model or any ESM-IF1 compatible weights of your choice. If no weights_path
is provided the script defaults to the vanilla model weights.
Additional arguments:
--normalize: pass if you want to normalize likelihood with wild-type sequence
--whole_seq: pass if you want to utilize liklihood of entire sequence, not just mutated residue(s)
--sum: pass if you want to sum likelihoods instead of averaging
--out_path: path for output csv
- Analyze Results
Located at the path given by the --out_path
argument will be a csv containing the specified model likelihood for each sequence.
Please cite the following preprint when referencing ProteinDPO.
@article {widatalla2024aligning,
author = {Widatalla, Talal and Rafailov, Rafael and Hie, Brian},
title = {Aligning protein generative models with experimental fitness via Direct Preference Optimization},
year = {2024},
doi = {10.1101/2024.05.20.595026},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/05/21/2024.05.20.595026},
journal = {bioRxiv}
}
This project is licensed under the MIT License - see the LICENSE file for details.