This project aims to leverage neural networks to accurately predict protein secondary structures while evaluating the performance of various RNN models for optimal results.
This project was written by Florian, Jun Kai and Min Jie.
The dataset for this project is sourced from Alfrandom - Protein Secondary Structure (Kaggle).
Model | Accuracy | Description | Link |
---|---|---|---|
GRU | 89% | Selected for its efficiency and ability to capture dependencies in sequential data. Faster to train than LSTMs while still providing strong performance. | View Notebook |
LSTM | 89% | Effective in handling long-range dependencies and mitigating the vanishing gradient problem. Suited for complex sequence prediction tasks. | View Notebook |
BI-GRU | 91% | Improves context capture by considering sequences in both forward and backward directions. Enhances performance in tasks like text classification. | View Notebook |
BI-LSTM | 91% | Processes data in both directions, leveraging information from past and future states. Ideal for nuanced understanding in tasks like language modeling and sentiment analysis. | View Notebook |
Comparison done in Google Colab.
Note: All models were run using the same hyperparameters and on the same hardware (L4 in Colab).
The main dataset lists peptide sequences and their corresponding secondary structures. It is a transformation of rcsb downloaded at 2018-06-06 from RSCB PDB into a tabular structure. If you download the file at a later time, the number of sequences in it will probably increase.
Column | Description |
---|---|
pdb_id | The ID used to locate the peptide's entry on RCSB PDB. |
chain_code | Code indicating a specific chain when a protein consists of multiple peptides (chains). |
seq | Amino acid sequence of the peptide. |
sst8 | Eight-state (Q8) secondary structure classification. |
sst3 | Three-state (Q3) secondary structure classification. |
len | Length of the peptide sequence. |
has_nonstd_aa | Indicates whether the peptide contains nonstandard amino acids (B, O, U, X, or Z). |
- Sub Selection: Sequences with standard AA and of (100 < length < 300 ) are selected.
- Kmers: The amino acid sequences (seq) are split into k-mers of length 3 using the seq2kmers function. This divides each sequence into overlapping substrings, helping the model capture local sequence patterns.
- Input Tokenization: The k-mers are then tokenized for embedding.
- Target tokenization: The target secondary structure labels are one-hot encoded to prepare them for classification.
Component | Description | Input Dimensions | Output Dimensions |
---|---|---|---|
Embedding Layer | Maps each tokenized k-mer to a 128-dimensional vector. n_words is the vocabulary size from tokenization. |
(MAX_AMINO_ACID_LEN,) |
(MAX_AMINO_ACID_LEN, 128) |
{MODEL} Layers | Two {MODEL} layers, each with 128 units, process sequence data and capture dependencies. Both layers return sequences, enabling per-residue secondary structure predictions. | (MAX_AMINO_ACID_LEN, 128) |
(MAX_AMINO_ACID_LEN, 128) |
Dropout Layers | Dropout of 0.2 after each LSTM layer to reduce overfitting. | (MAX_AMINO_ACID_LEN, 128) |
(MAX_AMINO_ACID_LEN, 128) |
Dense Output Layer | A softmax layer outputs probabilities across n_ssts classes (three-state or eight-state classification). |
(MAX_AMINO_ACID_LEN, 128) |
(MAX_AMINO_ACID_LEN, n_ssts) |
- Alfrandom: For providing the dataset through Alfrandom - Protein Secondary Structure.
Thank you!