Skip to content

This repository contains the datasets, scripts, and analyses for the Laboratory of Bioinformatics II course project, focusing on the prediction of secretory signal peptides.

License

Notifications You must be signed in to change notification settings

sofianatale/LB2_project_Group_5

 
 

Repository files navigation

LAB2 Project Group 5 – Prediction of Secretory Signal Peptides

Introduction – Signal Peptide Prediction Data Collection Data Preparation Data Analysis Von Heijne Method Evaluation of Classifiers Feature extraction SVM for SP detection Training Final Models and Analysis of Benchmarking Results

Table of Contents


About

This repository contains the scripts, notebooks, and documentation for the Laboratory of Bioinformatics II project.
The project addresses the prediction of secretory signal peptides in eukaryotic proteins, a crucial step for protein function prediction and subcellular localization.


Abstract

Predicting secretory signal peptides (SPs) is fundamental for understanding protein localization and function.
Traditional experimental methods are accurate but time-consuming, motivating the adoption of computational approaches.

This project compares three main strategies for SP prediction:

  1. Motif-based statistical approach (von Heijne, 1986)
  2. Support Vector Machine (SVM)-based classification (scikit-learn)
  3. Deep Learning approach (ESM-2 + MLP)

The aim is to evaluate their performance using curated datasets from UniProtKB, applying cross-validation and benchmarking on a blind test set.


Project Work Plan

Task Description
Retrieve datasets Collect relevant protein datasets from UniProtKB
Preprocess datasets Prepare data for cross-validation and benchmarking
Analyze statistics Compute and visualize dataset statistics
Feature extraction Extract relevant features for classification
von Heijne's algorithm Implement cleavage site prediction method
SVM classifier Train and test Support Vector Machine model
Evaluation Assess methods with cross-validation and blind test set
Deep Learning approach Implement ESM-2 embeddings + MLP classifier with Optuna optimization
Reporting Discuss and interpret results
Manuscript Prepare manuscript in scientific article format

Installation Guide

Python 3.8+ is required, along with the following main libraries: numpy, pandas, scikit-learn, matplotlib, seaborn.

To set up the environment and run the project, follow these steps:

  1. Clone the repository
git clone https://github.com/Martinaa1408/LB2_project_Group_5.git
cd LB2_project_Group_5
  1. Create and activate a virtual environment
python -m venv env
source env/bin/activate        # On Windows use: env\Scripts\activate
  1. Install the required libraries
pip install numpy pandas scikit-learn matplotlib seaborn
  1. Run the Main Scripts

Repository Structure


Results Summary

This section summarizes the core data, feature extraction, and predictive performance achieved by the two implemented models — Von Heijne (rule-based) and SVM (RBF kernel).

Data Collection & Curation

Phase Description SP⁺ SP⁻ Total Notes
Raw UniProtKB Manually reviewed S. cerevisiae proteins with signal peptide annotation 2,932 20,615 23,547 Experimental annotations only
After MMseqs2 clustering (30% ID) Non-redundant representative sequences 1,093 8,934 10,027 Removes homolog redundancy
Final dataset split 80% training + 20% independent benchmark 874 7,147 8,021 (train)
219 / 1,787 (bench)
Balanced and taxonomically representative

Feature Extraction Summary

Feature Category Description Example Features Count
Amino acid composition Residue frequencies in N-terminal region (–30 to +2 aa) comp_L, comp_A, comp_V 19
Hydrophobicity Kyte–Doolittle mean & max hydro_mean, hydro_max 2
Charge distribution Mean charge, max charge charge_mean, charge_max 2
Secondary structure α-helix propensity (Chou–Fasman scale) alpha_mean, alpha_max 2
Transmembrane propensity Mean & max TM index trans_mean, trans_max 2
Residue size/volume Mean and maximum residue volume size_mean, size_max 2

Total features extracted: 29

Features selected for SVM (RF importance): 15


Model Training and Optimization

Step Method Details / Parameters Output
Model 1 Von Heijne (rule-based) Position-Specific Weight Matrix (PSWM), optimized threshold by MCC Cleavage-site scoring function
Model 2 SVM (RBF kernel) C = 10, γ = 'scale', kernel = RBF; Stratified 5-fold CV Trained classifier on 15 features
Model 3 Deep Learning (ESM-2 + MLP) hidden_size1': 35, 'hidden_size2': 35, 'hidden_size3': 34, Dropout: 0.11525485528721369, LR: 0.0001552002336009008; Optuna optimization Trained neural network on ESM-2 embeddings

Quantitative Performance

Internal Evaluation (Training / Validation Set)

Metric Von Heijne SVM (RBF) Deep Learning
Accuracy 0.939 ± 0.002 0.927 0.995
Precision 0.708 ± 0.017 0.620 0.987
Recall (TPR) 0.756 ± 0.032 0.857 0.970
F1-score 0.728 ± 0.011 0.719 0.981
MCC 0.697 ± 0.013 0.6913 0.978

External Evaluation (Independent Benchmark)

Metric Von Heijne SVM (RBF) Deep Learning
Accuracy 0.930 0.922 0.990
Precision 0.665 0.594 0.971
Recall (TPR) 0.726 0.895 0.936
F1-score 0.694 0.714 0.953
MCC 0.656 0.690 0.948

Key Observations

Aspect Von Heijne SVM Deep Learning
False Positives (FP) Hydrophobic TM helices (Metazoa bias) Strongly reduced; fewer TM-related misclassifications Minimal (6 FP); balanced error distribution
False Negatives (FN) Short or polar SPs (<18 aa) Borderline SPs with weak α-helix signals Minimal (14 FN); robust to sequence variations
Motif capture Conserved [A,V]XA cleavage motif Broader tolerance to sequence variability Automatic feature learning; no manual motif definition
SP mean length 22.4 aa 21.9 aa No length bias detected
Interpretability High (biological motifs visible) Moderate (feature-dependent) Lower (black-box) but superior performance

Final Summary Table

Dataset Model Accuracy F1-score MCC Best For
Training / Validation Von Heijne 0.939 0.728 0.697 Baseline biological interpretability
SVM (RBF) 0.927 0.719 0.691 Pattern learning and discrimination
Deep Learning 0.995 0.981 0.978 Maximum predictive performance
Benchmark (Independent) Von Heijne 0.930 0.694 0.656 Motif-based baseline
SVM (RBF) 0.921 0.714 0.690 Robust generalization
Deep Learning 0.990 0.953 0.948 State-of-the-art classification

Conclusion:

The MLP leveraging ESM-2 embeddings outperforms both the SVM and rule-based models on all metrics, capturing canonical and atypical signal peptides with near-perfect accuracy and robust generalization.

The Von Heijne PSWM remains biologically interpretable and complements the MLP by providing motif-level insight into cleavage-site conservation.


Authors

This project has been developed by the following group members:


License

This project is released under the GPL-3.0 License.


Acknowledgements

This project is part of the Laboratory of Bioinformatics II course (University of Bologna, 2025). We would like to thank Professors Castrense Savojardo and Matteo Manfredi for their guidance, feedback and continuous support throughout the project.


References & Tools

Software stack

  • MMseqs2 — clustering and redundancy reduction
  • Python 3 — data preprocessing and analysis
  • Biopython — sequence handling, FASTA/TSV parsing, and biological data processing
  • scikit-learn (sklearn) — machine learning framework (SVM, evaluation metrics, preprocessing)
  • NumPy — numerical computation and matrix operations
  • Seaborn — statistical data visualization
  • ProtScale (ExPASy) — computation of physicochemical property scales (e.g. hydrophobicity)
  • AAindex - is a curated database of numerical indices describing the physicochemical and biochemical properties of amino acids.
  • SwissProt statistics — summary of protein counts, taxonomy coverage, and annotation status in UniProtKB/SwissProt releases.
  • WebLogo generator — tool for visualizing sequence motifs and residue conservation (used for cleavage site motif analysis).
  • PyTorch — deep learning framework for neural network modeling.
  • Bash utils — quick FASTA/TSV operations
  • Jupyter / Google Colab — environment for interactive workflows
  • conda environment tools — package and environment management

Key references

  • UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase. Nucleic Acids Research.
  • von Heijne G. (1986). A new method for predicting signal sequence cleavage sites. Nucleic Acids Research.
  • Cortes C. & Vapnik V. (1995). Support-Vector Networks. Machine Learning, 20(3): 273–297.
  • Kyte J. & Doolittle R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol.
  • Chou P.Y. & Fasman G.D. (1978). Prediction of protein conformation. Biochemistry.
  • Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science.
  • scikit-learn SVM documentation
  • MLP-tensorflow

About

This repository contains the datasets, scripts, and analyses for the Laboratory of Bioinformatics II course project, focusing on the prediction of secretory signal peptides.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.9%
  • Other 0.1%