Protein residues contact map prediction usign ESM2 embeddings

This Project is using convolutional model to predict contact map of protein residues using ESM2 embeddings and similar proteins structural information (inter residue distances and orientations).

Example Results

Model architecture

Model gets 2 tensors as an input

pairwise embeddings created using ESM2 model. In our case I am using esm2_t6_8M_UR50D model, because it is the smalles one. it takes as an input protein sequence in string format and gives a tensor as an output with shape (L + 2, D) where L is the length of the sequence and D is the size of the embeddings, we are removing first and last embeddings, because they are embeddings of start token and end token. After that we are creating a pairwise embeddings tensor with shape (L, L, 2 * D), where in the (i, j) coordinates we have embeddings of i-th and j-th residues
as second input model gets tensor which contains structural information of similar protein sequences. for each similar sequence we are creating a tensor with shape (L, L, 3) where as a first channel it contains distance information between residues, second channel is theta (polar angle) and third channel is phi (azimuthal orientation). Then we are concatenating That type of information for k number of similar protein sequences and at the end we have tensor of shape (L, L, 3 * k)

Then we are passing each input by their convolutional backbones, pairwise embeddings we are downsampling to have at the end tensor of shape (L, L, 128) and structural tensor we are upsampling to have tensor of shape (L, L, 64)

After we are concatenating them and passing through the fusion part of the model which is again a convolutional model. At the end we have matrix of shape (L, L)

Training

Preprocessing

Creating a BLAST database from train dataset which I will use later to find similar protein sequences.
Computing distances between residues Cα atoms and computing contact map (Two residues are considered in contact if the distance between their Cα atoms is below 8 Å (angstroms)). This will be our ground truth
Computing ESM2 embeddings and creating pairwise embeddings from input protein sequence.
Using BLAST database to find top k similar sequences (remove current input sequence if it is in the database).
Align those sequences with query sequence and for each similar sequence get the distance and orientation information
Normalize inputs

Training Information

Because contact matrix is symmetric and its diagonal is always ones, I decided to add a head to the model which will help him to predict only upper triangle matrix without diagonal, and then we can create our tensor from that upper triangle matrix. It makes problem easier and helped model to train faster and better.

As a loss I used weighted sum of Dice and binary cross entropy losses. Also I gave a weight to the positive class of binary cross entropy loss because of class imbalance in our data (Contact maps are very sparse).

AdamW is used as an optimizer and as a scheduler I used OneCycleLR with cosine annealing strategy.

F1 score and Accuracy are used as metrics

You can find training example in train.ipynb notebook

Limitations

In the code I did some simplifactions and added limitations to be able to experiment with limited compute access (I was working only with one Nvidia 3050Ti and sometime in google colab notebook)

I have added variable max_sequence_lenght which gets only first 100 residues from the sequence
I am working only with first chain of the protein from PDB files
I am using smallest ESM2 architecture
I trained only on a small subset of the dataset to be sure that model is learning

TODOs

There are a lot of improvements that can be done on the project :)

Add batched inference (currently we are unable to do batched inference due to variabel length of the input sequences)
Experiment with different model architectures
Train on longer sequences and on a bigger dataset
Make dataset module faster, because there are currently a lot of work is done in runtime and it can be preprocessed differently to make training faster
Experiment with learning rate and scheduler to find best choise
Fix warning from the BLAST

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
images		images
model		model
.gitignore		.gitignore
README.md		README.md
lightning_module.py		lightning_module.py
requirements.txt		requirements.txt
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein residues contact map prediction usign ESM2 embeddings

Example Results

Model architecture

Training

Preprocessing

Training Information

Limitations

TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Languages

DavitGrigoryan132/ContactMapPredictor

Folders and files

Latest commit

History

Repository files navigation

Protein residues contact map prediction usign ESM2 embeddings

Example Results

Model architecture

Training

Preprocessing

Training Information

Limitations

TODOs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages