DELIGHT (Data-Enriched Label-Informed Generation of Homologous sequences using Transformers) is a slightly convoluted acronym that condenses the content of the paper:
"Data augmentation enables label-specific generation of homologous protein sequences".
This repository provides:
- The code to reproduce the full pipeline described in the paper.
- Jupyter notebooks to regenerate the figures used in the publication.
To use this code, you’ll need to install several dependencies:
transformersby 🤗 HuggingFace- Utilities from
adabmDCA - The general-purpose
rbmspackage for training RBMs - Our custom
annaDCApackage for label-informed RBM generation
We recommend using a conda environment with Python ≥ 3.11:
conda create -n delight python=3.12
conda activate delight
python -m pip install -r requirements.txtWarning
Due to a temporary incompatibility between the package rbms and torch==2.6, you need to follow the following procedure to install rbms. First, install the package through the GitHub repository. Move to a base repository and then do
git clone https://github.com/DsysDML/rbms.git
cd rbmsThen, open the file pyproject.toml and, under the flag dependencies, do the change: "torch>=2.0.0, <=2.5.0" "torch>=2.0.0, <=2.6.0".
After that, you can manually install the package inside the conda environment you just created
python3 -m pip install .The pipeline requires two input files:
- A training file with annotated sequences
- A query file with sequences to be annotated
Training file (CSV format) must include:
header: sequence identifierssequence: full-length sequencessequence_align: aligned versions of the sequenceslabel: functional or structural annotations
Custom column names can be provided via CLI arguments.
Query file can be either:
- A FASTA file
- A CSV with at least
headerandsequencecolumns
To embed the query sequences using a protein Language Model (pLM) and predict their annotations, run:
python3 ./src/pLM_encoding.py \
--train <training_file> \
--query <query_file> \
--flag <flag_name> \
--zero-shot \
--bf16<flag_name>is a string added to output files for traceability.--zero-shotuses the foundation model without fine-tuning.
To see all available options:
python3 ./src/pLM_encoding.py -h--column_headers: defaults toheader--column_sequences: defaults tosequence--column_labels: defaults tolabel
.npzfile with train embeddings.npzfile with query embeddings, predicted labels, and confidence scores.csvfile with query sequences and predicted labels
Once predictions are available, you can train a label-aware RBM model:
annadca train \
-d <data_file.csv> \
-o <output_dir> \
--column_names <column_headers> \
--column_sequences <column_sequences_align> \
--column_labels <column_labels> \
--nepochs 30000 \
--nchains 5000data_file.csvis the output from the previous embedding step.- The model will be saved in
<output_dir>.
Note
<column_sequences_align> must contain aligned sequences. Full-length sequences are not accepted.
Note
For better performance, we recommend merging the CSV file containing predicted labels with the original training file used in pLM_encoding.py.
To generate new sequences based on specific annotations using the trained RBM model, refer to:
./notebooks/Conditioned_generation.ipynb
To replicate the results from the paper:
- Create the necessary directories and download the datasets:
cd DELIGHT
mkdir experiments && cd experiments
wget https://zenodo.org/records/15979182/files/datasets.zip
unzip datasets.zip && rm datasets.zip
mkdir models && cd ..- Run the script to train the models and compute embeddings:
chmod +x ./bash/reproduce_paper_results.sh
./bash/reproduce_paper_results.shWarning
This step is computationally intensive and may take a while.
-
Section II-B (Data augmentation):
Use./notebooks/Classification.ipynb -
Section II-C (Label-specific generation):
Use./notebooks/Conditioned_generation.ipynband./notebooks/False_positives_analysis.ipynb -
Additional visualizations:
Use./notebooks/Additional_figures.ipynb
@misc{rosset2025dataaugmentationenableslabelspecific,
title={Data augmentation enables label-specific generation of homologous protein sequences},
author={Lorenzo Rosset and Martin Weigt and Francesco Zamponi},
year={2025},
eprint={2507.15651},
archivePrefix={arXiv},
primaryClass={q-bio.QM},
url={https://arxiv.org/abs/2507.15651},
}