Generating Novel Proteins with Biological Constraints Using Deep Conditional Language Modeling

This repository contains code and datasets related to my Bachelor’s Thesis, completed as part of the Bachelor’s Degree in Bioinformatics (BDBI) at ESCI-UPF in Barcelona, conducted at Nostrum Biodiscovery and awarded Best Bachelor's Thesis of the 2022/2023 Course 🏅.

🔗 The full thesis can be accessed here.

🌟 Project Overview

This project focuses on protein sequence generation using deep learning models. Initially, two architectures were compared:

pGAN: A self-attention-based GAN variant.
pLM: A Transformer-based protein language model (pLM) pre-trained on evolutionary data.

Then, a pLM was pre-trained on a subset of the UniRef50 dataset using a masked language modeling (MLM) task. Fine-tuning on specific protein families, such as bacterial MDH, demonstrated that the model generates sequences that align with natural protein principles. We also explored adding conditioning tags to guide the generation process based on enzymatic reactions, although this provided minimal improvements.

_{General scheme of the transfer learning-based pretraining and fine-tuning approach. This approach leverages knowledge and representations learned from a
large dataset to enhance performance on a specific task with a smaller, specific dataset. Initially, a neural network is trained by randomly initializing the weights and optimizing
them to minimize task-related errors. Upon achieving satisfactory training results, the network weights are saved. To train a new network for a different task and dataset,
instead of starting from random initialization, the saved weights from the previous network are used as the initial values. In this initialization process, the first network is the
pre-trained network, and the second network undergoes fine-tuning..}

⚙️ Setup

🛠 Requirements

Install the required dependencies from requirements.yaml:
```
conda env create -f requirements.yaml
```

🗂 Repository Structure

configs/: Configuration templates for both pGAN and pLM.
data/: Sample datasets (in .tar.gz format) used in the experiments.
src/: Source code for all the models, training, and inference scripts.

📊 Datasets

Example datasets are available in the data/ directory.
You can use custom datasets for training, but they must be in TSV/CSV format and compressed into .tar.gz. Be sure to update the config file to specify the correct column index for the protein sequences.

🚀 Usage

1. Pretrain and Fine-tune pLM

Customize the pLM configuration in configs/plm_config.yaml (select pretraining/fine-tuning task, adjust dataset path, batch size, devices, etc.).
Log into your Weights and Biases account to track training runs:
```
wandb login <your-API-key>
```
To pretrain or fine-tune the model:
```
python src/plm/main.py
```

Note: For SLURM environments, add srun before the Python command to ensure proper parallel execution.

2. Generate Sequences with pLM

To generate protein sequences after training:

python src/plm/inference.py -n 100 -checkpoint data/checkpoints/model_weights.ckpt

Additional options include:

-k: Beam size for beam search algorithm.
-num_tokens: Number of tokens used in training (default is 29, may change if conditioning tags are used).

3. Train pGAN

Customize the pGAN configuration in configs/pgan_config.yaml (adjust dataset path, batch size, etc.).
To train the GAN model:
```
python src/pgan/main.py
```

Note: The pGAN does not support multi-device training or the use of libraries such as DeepSpeed or Wandb. Training logs are saved locally, and loss curves are output as images.

📖 Extended Work and Publication

The work initiated in this project was extended at Nostrum Biodiscovery and contributed to the publication of "Efficient and Accurate Sequence Generation with Small-Scale Protein Language Models." This paper introduces a Small-Scale Protein Language Model (SS-pLM), which significantly reduces computational requirements by using fewer parameters and a smaller dataset, while achieving performance comparable to larger models in protein sequence generation. Explore the full publication here 📘!

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
configs		configs
data		data
img		img
src		src
.gitignore		.gitignore
README.md		README.md
requirements.yaml		requirements.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating Novel Proteins with Biological Constraints Using Deep Conditional Language Modeling

🌟 Project Overview

⚙️ Setup

🛠 Requirements

🗂 Repository Structure

📊 Datasets

🚀 Usage

1. Pretrain and Fine-tune pLM

2. Generate Sequences with pLM

3. Train pGAN

📖 Extended Work and Publication

Acknowledgments

About

Releases

Packages

Languages

yaizasear/pLM-protein-generation

Folders and files

Latest commit

History

Repository files navigation

Generating Novel Proteins with Biological Constraints Using Deep Conditional Language Modeling

🌟 Project Overview

⚙️ Setup

🛠 Requirements

🗂 Repository Structure

📊 Datasets

🚀 Usage

1. Pretrain and Fine-tune pLM

2. Generate Sequences with pLM

3. Train pGAN

📖 Extended Work and Publication

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages