Fine-Tuned Protein Language Models for Targeted Antibody Sequence Generation.
peleke-1 is a suite of antibody language models that were fine-tuned to generate antibody sequences that specifically target given antigen sequences. By leveraging advanced protein language models and general large language models, each peleke-1 model aims to streamline the process of in silico antibody design.
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = 'silicobio/peleke-phi-4'
config = PeftConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, model_name).cuda()Currently, the supported models are:
peleke-phi-4, based on Microsoft's Phi-4 model.peleke-llama-3.1-8b-instruct, based on Meta's Llama 3.1 8B Instruct model.peleke-mistral-7b-instruct-v0.2, based on Mistral's 7B Instruct v0.2 model.
The full collection of peleke-1 models can be found on Hugging Face: silicobio/peleke-1. This collection also includes "merged" versions of the models (base model + LoRA weights and custom tokenizer) and GGUF versions (for easy local inferencing using tools like Ollama).
You can also fine-tune your own peleke-1-like model by following the fine-tuning logic that can be found under scripts/.
The peleke-1 suite of models expects an amino acid sequence of an antigen protein as an input. Epitope residues should be enclosed in <epi> and </epi>> tokens.
However, if you prefer to use square brackets [ ], which are easier, use the following function:
def format_prompt(antigen_sequence):
epitope_seq = re.sub(r'\[([A-Z])\]', r'<epi>\1</epi>', antigen_sequence)
formatted_str = f"Antigen: {epitope_seq}<|im_end|>\nAntibody:"
return formatted_strFor example, AAM[K][R]HGL[D][N][Y]RG will get formatted as AAM<epi>K</epi><epi>R</epi>HGL<epi>D</epi><epi>N</epi><epi>Y</epi>RG, using <epi> and </epi> as special tags.
The training dataset consists of paired antigen and antibody sequences, where the antigen is the target for which the antibody is generated. This was curated from SAbDab. Using PandaProt, epitope residues were highlighted in the antibody sequences using [ ], which helped to tune the model to generate antibodies sequences that fold and bind to specific epitopes on the desired antigen. Note that multi-chain antigen sequences are delimited by | in the antigen_ids column, and the heavy and light chain antibody sequences are delimited by | in the antibody_sequences column. We also provide the Fv portions of the antibodies chains, which (for length consistency) were used to tune the models.
| Column Name | Description | Example |
|---|---|---|
| pdb_id | The PDB ID on Protein Data Bank | 8xa4 |
| h_chain_id | The chain ID of the antibody's heavy chain | C |
| l_chain_id | The chain ID of the antibody's light chain | D |
| antigen_ids | A |-delimited list of chain IDs of the antigen chain(s) | A|B |
| h_chain_seq | The heavy chain amino acid sequence | QLQLQESGPG… |
| l_chain_seq | The light chain amino acid sequence | EIVLTQSPGT… |
| antigen_seqs | The antigen sequence(s), |-delimited | SCNGL...|SCNGL… |
| antibody_seqs | The heavy and light chain sequences, |-delimited | QLQLQ...|EIVLT… |
| h_chain_fv_seq | The heavy chain sequence, trimmed to the Fv portion | QLQLQESGPG… |
| l_chain_fv_seq | The light chain sequence, trimmed to the Fv portion | EIVLTQSPGT… |
| antibody_fv_seqs | The heavy and light chain Fv sequences, |-delimited | QLQLQ...|EIVLT… |
| highlighted_epitope_seqs | The antigen sequence(s) with epitope residues encased in [ ] | ...WLI[D][Y]V[E][D][T]WGS… |
| epitope_residues | The list of epitope residues, |-delimited in a "chain: AA #" format | A:ARG 176|A:ASP 146|A:ASP 150… |
- See the prepared training dataset:
- In this repo: data/sabdab/sabdab_training_dataset.csv
- On Hugging Face: silicobio/peleke_antibody-antigen_sabdab
- Our data preparation scripts:
- Get sequences from PDB structures: data/sabdab/01_get_structure_seqs.ipynb
- Detect contacts and highlight epitopes: data/sabdab/02b_pandaprot_parallel.ipynb
- Generate the training dataset: data/sabdab/03_generate_dataset.ipynb
This work was presented at the 7th Molecular Machine Learning Conference (MoML @ MIT 2025) on October 22, 2025. Here's our poster:
If you use the peleke-1 models, tuning code, or the antibody-antigen dataset, we'd love for you to cite us.
Santolla, Nicholas and Pridgen, Trey and Nigam, Prbhuv and Ford, Colby T. (2025). “peleke-1: A Suite of Protein Language Models Fine-Tuned for Targeted Antibody Sequence Generation.” bioRxiv. https://doi.org/10.1101/2025.10.16.682644
@article{peleke-1,
author = {Santolla, Nicholas and Pridgen, Trey and Nigam, Prbhuv and Ford, Colby T.},
title = {peleke-1: A Suite of Protein Language Models Fine-Tuned for Targeted Antibody Sequence Generation},
elocation-id = {2025.10.16.682644},
year = {2025},
doi = {10.1101/2025.10.16.682644},
publisher = {Cold Spring Harbor Laboratory},
eprint = {https://www.biorxiv.org/content/early/2025/10/16/2025.10.16.682644.full.pdf},
journal = {bioRxiv}
}