LAB2 Project Group 5 – Prediction of Secretory Signal Peptides

About

This repository contains the scripts, notebooks, and documentation for the Laboratory of Bioinformatics II project.
The project addresses the prediction of secretory signal peptides in eukaryotic proteins, a crucial step for protein function prediction and subcellular localization.

Abstract

Predicting secretory signal peptides (SPs) is fundamental for understanding protein localization and function.
Traditional experimental methods are accurate but time-consuming, motivating the adoption of computational approaches.

This project compares three main strategies for SP prediction:

Motif-based statistical approach (von Heijne, 1986)
Support Vector Machine (SVM)-based classification (scikit-learn)
Deep Learning approach (ESM-2 + MLP)

The aim is to evaluate their performance using curated datasets from UniProtKB, applying cross-validation and benchmarking on a blind test set.

Project Work Plan

Task	Description
Retrieve datasets	Collect relevant protein datasets from UniProtKB
Preprocess datasets	Prepare data for cross-validation and benchmarking
Analyze statistics	Compute and visualize dataset statistics
Feature extraction	Extract relevant features for classification
von Heijne's algorithm	Implement cleavage site prediction method
SVM classifier	Train and test Support Vector Machine model
Evaluation	Assess methods with cross-validation and blind test set
Deep Learning approach	Implement ESM-2 embeddings + MLP classifier with Optuna optimization
Reporting	Discuss and interpret results
Manuscript	Prepare manuscript in scientific article format

Installation Guide

Python 3.8+ is required, along with the following main libraries: numpy, pandas, scikit-learn, matplotlib, seaborn.

To set up the environment and run the project, follow these steps:

Clone the repository

git clone https://github.com/Martinaa1408/LB2_project_Group_5.git
cd LB2_project_Group_5

Create and activate a virtual environment

python -m venv env
source env/bin/activate        # On Windows use: env\Scripts\activate

Install the required libraries

pip install numpy pandas scikit-learn matplotlib seaborn

Run the Main Scripts

Repository Structure

Data_Collection/
Retrieval of raw datasets from UniProtKB.
→ Main notebook: Data_Collection.ipynb
Data_Preparation/
Redundancy reduction with MMseqs2 and generation of train/benchmark sets with CV folds.
→ Key scripts: clustering.sh, get_tsv.py, get_sets.py
Data_Analysis/
Exploratory analysis of datasets (length distributions, amino acid composition, taxonomy, cleavage motifs).
→ Main notebook: DataAnalysis.ipynb
von_Heijne/
Implementation of the von Heijne (1986) statistical method for SP cleavage site prediction.
→ Main notebook: vonHeijne.ipynb
Features_extraction/ Feature extraction from N-terminal regions (AA composition + physicochemical scales) and export of the feature matrix.
→ Main notebook: Features_extraction_SVM.ipynb
SVM_method/ Implementation of the Support Vector Machine (SVM) classifier with feature selection and hyperparameter optimization.
→ Main notebook: SVM_Method.ipynb
Evaluation_and_Comparison/ Comparative analysis of SVM and von Heijne models using independent test sets and performance metrics visualization.
→ Main notebook: Evaluation_and_Comparisons.ipynb
Deep_Learning/ Implementation of a Multi-Layer Perceptron (MLP) classifier using Transfer Learning from pre-extracted ESM2 protein embeddings, optimized via Optuna for best performance. → Main notebook: DL_analysis.ipynb
Supplementary_materials/ Additional resources, figures, and intermediate files supporting the main notebooks and report.
→ PDF documents: Supplementary_materials_LB2_Group5.pdf;Manual_LB2_Group5.pdf
Report/ PDF version of the final report related to the project. →PDF document: report_LB2_Group5.pdf
README.md
General project overview and workflow description.
LICENSE
Open-source license (GPL-3.0).

Results Summary

This section summarizes the core data, feature extraction, and predictive performance achieved by the two implemented models — Von Heijne (rule-based) and SVM (RBF kernel).

Data Collection & Curation

Phase	Description	SP⁺	SP⁻	Total	Notes
Raw UniProtKB	Manually reviewed S. cerevisiae proteins with signal peptide annotation	2,932	20,615	23,547	Experimental annotations only
After MMseqs2 clustering (30% ID)	Non-redundant representative sequences	1,093	8,934	10,027	Removes homolog redundancy
Final dataset split	80% training + 20% independent benchmark	874	7,147	8,021 (train) 219 / 1,787 (bench)	Balanced and taxonomically representative

Feature Extraction Summary

Feature Category	Description	Example Features	Count
Amino acid composition	Residue frequencies in N-terminal region (–30 to +2 aa)	comp_L, comp_A, comp_V	19
Hydrophobicity	Kyte–Doolittle mean & max	hydro_mean, hydro_max	2
Charge distribution	Mean charge, max charge	charge_mean, charge_max	2
Secondary structure	α-helix propensity (Chou–Fasman scale)	alpha_mean, alpha_max	2
Transmembrane propensity	Mean & max TM index	trans_mean, trans_max	2
Residue size/volume	Mean and maximum residue volume	size_mean, size_max	2

Total features extracted: 29

Features selected for SVM (RF importance): 15

Model Training and Optimization

Step	Method	Details / Parameters	Output
Model 1	Von Heijne (rule-based)	Position-Specific Weight Matrix (PSWM), optimized threshold by MCC	Cleavage-site scoring function
Model 2	SVM (RBF kernel)	`C = 10`, `γ = 'scale'`, `kernel = RBF`; Stratified 5-fold CV	Trained classifier on 15 features
Model 3	Deep Learning (ESM-2 + MLP)	`hidden_size1': 35, 'hidden_size2': 35, 'hidden_size3': 34`, `Dropout: 0.11525485528721369`, `LR: 0.0001552002336009008`; Optuna optimization	Trained neural network on ESM-2 embeddings

Quantitative Performance

Internal Evaluation (Training / Validation Set)

Metric	Von Heijne	SVM (RBF)	Deep Learning
Accuracy	0.939 ± 0.002	0.927	0.995
Precision	0.708 ± 0.017	0.620	0.987
Recall (TPR)	0.756 ± 0.032	0.857	0.970
F1-score	0.728 ± 0.011	0.719	0.981
MCC	0.697 ± 0.013	0.6913	0.978

External Evaluation (Independent Benchmark)

Metric	Von Heijne	SVM (RBF)	Deep Learning
Accuracy	0.930	0.922	0.990
Precision	0.665	0.594	0.971
Recall (TPR)	0.726	0.895	0.936
F1-score	0.694	0.714	0.953
MCC	0.656	0.690	0.948

Key Observations

Aspect	Von Heijne	SVM	Deep Learning
False Positives (FP)	Hydrophobic TM helices (Metazoa bias)	Strongly reduced; fewer TM-related misclassifications	Minimal (6 FP); balanced error distribution
False Negatives (FN)	Short or polar SPs (<18 aa)	Borderline SPs with weak α-helix signals	Minimal (14 FN); robust to sequence variations
Motif capture	Conserved [A,V]XA cleavage motif	Broader tolerance to sequence variability	Automatic feature learning; no manual motif definition
SP mean length	22.4 aa	21.9 aa	No length bias detected
Interpretability	High (biological motifs visible)	Moderate (feature-dependent)	Lower (black-box) but superior performance

Final Summary Table

Dataset	Model	Accuracy	F1-score	MCC	Best For
Training / Validation	Von Heijne	0.939	0.728	0.697	Baseline biological interpretability
	SVM (RBF)	0.927	0.719	0.691	Pattern learning and discrimination
	Deep Learning	0.995	0.981	0.978	Maximum predictive performance
Benchmark (Independent)	Von Heijne	0.930	0.694	0.656	Motif-based baseline
	SVM (RBF)	0.921	0.714	0.690	Robust generalization
	Deep Learning	0.990	0.953	0.948	State-of-the-art classification

Conclusion:

The MLP leveraging ESM-2 embeddings outperforms both the SVM and rule-based models on all metrics, capturing canonical and atypical signal peptides with near-perfect accuracy and robust generalization.

The Von Heijne PSWM remains biologically interpretable and complements the MLP by providing motif-level insight into cleavage-site conservation.

Authors

This project has been developed by the following group members:

Martina Castellucci – @Martinaa1408
Alessia Corica – @alessia-corica
Anna Rossi – @AnnaRossi01
Sofia Natale – @sofianatale

License

This project is released under the GPL-3.0 License.

Acknowledgements

This project is part of the Laboratory of Bioinformatics II course (University of Bologna, 2025). We would like to thank Professors Castrense Savojardo and Matteo Manfredi for their guidance, feedback and continuous support throughout the project.

References & Tools

Software stack

MMseqs2 — clustering and redundancy reduction
Python 3 — data preprocessing and analysis
Biopython — sequence handling, FASTA/TSV parsing, and biological data processing
scikit-learn (sklearn) — machine learning framework (SVM, evaluation metrics, preprocessing)
NumPy — numerical computation and matrix operations
Seaborn — statistical data visualization
ProtScale (ExPASy) — computation of physicochemical property scales (e.g. hydrophobicity)
AAindex - is a curated database of numerical indices describing the physicochemical and biochemical properties of amino acids.
SwissProt statistics — summary of protein counts, taxonomy coverage, and annotation status in UniProtKB/SwissProt releases.
WebLogo generator — tool for visualizing sequence motifs and residue conservation (used for cleavage site motif analysis).
PyTorch — deep learning framework for neural network modeling.
Bash utils — quick FASTA/TSV operations
Jupyter / Google Colab — environment for interactive workflows
conda environment tools — package and environment management

Key references

UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase. Nucleic Acids Research.
von Heijne G. (1986). A new method for predicting signal sequence cleavage sites. Nucleic Acids Research.
Cortes C. & Vapnik V. (1995). Support-Vector Networks. Machine Learning, 20(3): 273–297.
Kyte J. & Doolittle R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol.
Chou P.Y. & Fasman G.D. (1978). Prediction of protein conformation. Biochemistry.
Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science.
scikit-learn SVM documentation
MLP-tensorflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LAB2 Project Group 5 – Prediction of Secretory Signal Peptides

Table of Contents

About

Abstract

Project Work Plan

Installation Guide

Repository Structure

Results Summary

Data Collection & Curation

Feature Extraction Summary

Model Training and Optimization

Quantitative Performance

Internal Evaluation (Training / Validation Set)

External Evaluation (Independent Benchmark)

Key Observations

Final Summary Table

Conclusion:

Authors

License

Acknowledgements

References & Tools

Software stack

Key references

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 478 Commits
Data_Analysis		Data_Analysis
Data_Collection		Data_Collection
Data_Preparation		Data_Preparation
Deep_Learning		Deep_Learning
Evaluation_and_Comparison		Evaluation_and_Comparison
Features_extraction		Features_extraction
Report		Report
SVM_method		SVM_method
Supplementary_materials		Supplementary_materials
von_Heijne		von_Heijne
LICENCE		LICENCE
README.md		README.md

License

sofianatale/LB2_project_Group_5

Folders and files

Latest commit

History

Repository files navigation

LAB2 Project Group 5 – Prediction of Secretory Signal Peptides

Table of Contents

About

Abstract

Project Work Plan

Installation Guide

Repository Structure

Results Summary

Data Collection & Curation

Feature Extraction Summary

Model Training and Optimization

Quantitative Performance

Internal Evaluation (Training / Validation Set)

External Evaluation (Independent Benchmark)

Key Observations

Final Summary Table

Conclusion:

Authors

License

Acknowledgements

References & Tools

Software stack

Key references

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages