Skip to content

🦋 peleke-1: Protein Language Models for Targeted Antibody Design

License

Notifications You must be signed in to change notification settings

silicobio/peleke

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

298 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

peleke-1 🦋

Fine-Tuned Protein Language Models for Targeted Antibody Sequence Generation.

Nicholas Santolla, Trey Pridgen, Prbhuv Nigam, and Colby T. Ford

Silico Biosciences

Preprint

About

peleke-1 is a suite of antibody language models that were fine-tuned to generate antibody sequences that specifically target given antigen sequences. By leveraging advanced protein language models and general large language models, each peleke-1 model aims to streamline the process of in silico antibody design.

Generate Antibody Sequences

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = 'silicobio/peleke-phi-4'
config = PeftConfig.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, model_name).cuda()

Currently, the supported models are:

The full collection of peleke-1 models can be found on Hugging Face: silicobio/peleke-1. This collection also includes "merged" versions of the models (base model + LoRA weights and custom tokenizer) and GGUF versions (for easy local inferencing using tools like Ollama).

You can also fine-tune your own peleke-1-like model by following the fine-tuning logic that can be found under scripts/.

Tokenization

The peleke-1 suite of models expects an amino acid sequence of an antigen protein as an input. Epitope residues should be enclosed in <epi> and </epi>> tokens. However, if you prefer to use square brackets [ ], which are easier, use the following function:

def format_prompt(antigen_sequence):
    epitope_seq = re.sub(r'\[([A-Z])\]', r'<epi>\1</epi>', antigen_sequence)
    formatted_str = f"Antigen: {epitope_seq}<|im_end|>\nAntibody:"
    return formatted_str

For example, AAM[K][R]HGL[D][N][Y]RG will get formatted as AAM<epi>K</epi><epi>R</epi>HGL<epi>D</epi><epi>N</epi><epi>Y</epi>RG, using <epi> and </epi> as special tags.

Training Dataset

The training dataset consists of paired antigen and antibody sequences, where the antigen is the target for which the antibody is generated. This was curated from SAbDab. Using PandaProt, epitope residues were highlighted in the antibody sequences using [ ], which helped to tune the model to generate antibodies sequences that fold and bind to specific epitopes on the desired antigen. Note that multi-chain antigen sequences are delimited by | in the antigen_ids column, and the heavy and light chain antibody sequences are delimited by | in the antibody_sequences column. We also provide the Fv portions of the antibodies chains, which (for length consistency) were used to tune the models.

Column Name Description Example
pdb_id The PDB ID on Protein Data Bank 8xa4
h_chain_id The chain ID of the antibody's heavy chain C
l_chain_id The chain ID of the antibody's light chain D
antigen_ids A |-delimited list of chain IDs of the antigen chain(s) A|B
h_chain_seq The heavy chain amino acid sequence QLQLQESGPG…
l_chain_seq The light chain amino acid sequence EIVLTQSPGT…
antigen_seqs The antigen sequence(s), |-delimited SCNGL...|SCNGL…
antibody_seqs The heavy and light chain sequences, |-delimited QLQLQ...|EIVLT…
h_chain_fv_seq The heavy chain sequence, trimmed to the Fv portion QLQLQESGPG…
l_chain_fv_seq The light chain sequence, trimmed to the Fv portion EIVLTQSPGT…
antibody_fv_seqs The heavy and light chain Fv sequences, |-delimited QLQLQ...|EIVLT…
highlighted_epitope_seqs The antigen sequence(s) with epitope residues encased in [ ] ...WLI[D][Y]V[E][D][T]WGS…
epitope_residues The list of epitope residues, |-delimited in a "chain: AA #" format A:ARG 176|A:ASP 146|A:ASP 150…

Poster

This work was presented at the 7th Molecular Machine Learning Conference (MoML @ MIT 2025) on October 22, 2025. Here's our poster:

Citation

If you use the peleke-1 models, tuning code, or the antibody-antigen dataset, we'd love for you to cite us.

Santolla, Nicholas and Pridgen, Trey and Nigam, Prbhuv and Ford, Colby T. (2025). “peleke-1: A Suite of Protein Language Models Fine-Tuned for Targeted Antibody Sequence Generation.” bioRxiv. https://doi.org/10.1101/2025.10.16.682644
@article{peleke-1,
	author = {Santolla, Nicholas and Pridgen, Trey and Nigam, Prbhuv and Ford, Colby T.},
	title = {peleke-1: A Suite of Protein Language Models Fine-Tuned for Targeted Antibody Sequence Generation},
	elocation-id = {2025.10.16.682644},
	year = {2025},
	doi = {10.1101/2025.10.16.682644},
	publisher = {Cold Spring Harbor Laboratory},
	eprint = {https://www.biorxiv.org/content/early/2025/10/16/2025.10.16.682644.full.pdf},
	journal = {bioRxiv}
}