This repository contains code and datasets related to my Bachelor’s Thesis, completed as part of the Bachelor’s Degree in Bioinformatics (BDBI) at ESCI-UPF in Barcelona, conducted at Nostrum Biodiscovery and awarded Best Bachelor's Thesis of the 2022/2023 Course 🏅.
🔗 The full thesis can be accessed here.
This project focuses on protein sequence generation using deep learning models. Initially, two architectures were compared:
- pGAN: A self-attention-based GAN variant.
- pLM: A Transformer-based protein language model (pLM) pre-trained on evolutionary data.
Then, a pLM was pre-trained on a subset of the UniRef50 dataset using a masked language modeling (MLM) task. Fine-tuning on specific protein families, such as bacterial MDH, demonstrated that the model generates sequences that align with natural protein principles. We also explored adding conditioning tags to guide the generation process based on enzymatic reactions, although this provided minimal improvements.
General scheme of the transfer learning-based pretraining and fine-tuning approach. This approach leverages knowledge and representations learned from a large dataset to enhance performance on a specific task with a smaller, specific dataset. Initially, a neural network is trained by randomly initializing the weights and optimizing them to minimize task-related errors. Upon achieving satisfactory training results, the network weights are saved. To train a new network for a different task and dataset, instead of starting from random initialization, the saved weights from the previous network are used as the initial values. In this initialization process, the first network is the pre-trained network, and the second network undergoes fine-tuning..
- Install the required dependencies from
requirements.yaml
:conda env create -f requirements.yaml
configs/
: Configuration templates for both pGAN and pLM.data/
: Sample datasets (in.tar.gz
format) used in the experiments.src/
: Source code for all the models, training, and inference scripts.
- Example datasets are available in the
data/
directory. - You can use custom datasets for training, but they must be in TSV/CSV format and compressed into
.tar.gz
. Be sure to update the config file to specify the correct column index for the protein sequences.
- Customize the pLM configuration in
configs/plm_config.yaml
(select pretraining/fine-tuning task, adjust dataset path, batch size, devices, etc.). - Log into your Weights and Biases account to track training runs:
wandb login <your-API-key>
- To pretrain or fine-tune the model:
python src/plm/main.py
Note: For SLURM environments, add srun
before the Python command to ensure proper parallel execution.
To generate protein sequences after training:
python src/plm/inference.py -n 100 -checkpoint data/checkpoints/model_weights.ckpt
Additional options include:
-k
: Beam size for beam search algorithm.-num_tokens
: Number of tokens used in training (default is 29, may change if conditioning tags are used).
-
Customize the pGAN configuration in
configs/pgan_config.yaml
(adjust dataset path, batch size, etc.). -
To train the GAN model:
python src/pgan/main.py
Note: The pGAN does not support multi-device training or the use of libraries such as DeepSpeed or Wandb. Training logs are saved locally, and loss curves are output as images.
The work initiated in this project was extended at Nostrum Biodiscovery and contributed to the publication of "Efficient and Accurate Sequence Generation with Small-Scale Protein Language Models." This paper introduces a Small-Scale Protein Language Model (SS-pLM), which significantly reduces computational requirements by using fewer parameters and a smaller dataset, while achieving performance comparable to larger models in protein sequence generation. Explore the full publication here 📘!