- About
- Abstract
- Project Work Plan
- Installation Guide
- Repository Structure
- Results Summary
- Authors
- License
- Acknowledgements
- References & Tools
This repository contains the scripts, notebooks, and documentation for the Laboratory of Bioinformatics II project.
The project addresses the prediction of secretory signal peptides in eukaryotic proteins, a crucial step for protein function prediction and subcellular localization.
Predicting secretory signal peptides (SPs) is fundamental for understanding protein localization and function.
Traditional experimental methods are accurate but time-consuming, motivating the adoption of computational approaches.
This project compares three main strategies for SP prediction:
- Motif-based statistical approach (von Heijne, 1986)
- Support Vector Machine (SVM)-based classification (scikit-learn)
- Deep Learning approach (ESM-2 + MLP)
The aim is to evaluate their performance using curated datasets from UniProtKB, applying cross-validation and benchmarking on a blind test set.
| Task | Description |
|---|---|
| Retrieve datasets | Collect relevant protein datasets from UniProtKB |
| Preprocess datasets | Prepare data for cross-validation and benchmarking |
| Analyze statistics | Compute and visualize dataset statistics |
| Feature extraction | Extract relevant features for classification |
| von Heijne's algorithm | Implement cleavage site prediction method |
| SVM classifier | Train and test Support Vector Machine model |
| Evaluation | Assess methods with cross-validation and blind test set |
| Deep Learning approach | Implement ESM-2 embeddings + MLP classifier with Optuna optimization |
| Reporting | Discuss and interpret results |
| Manuscript | Prepare manuscript in scientific article format |
Python 3.8+ is required, along with the following main libraries:
numpy, pandas, scikit-learn, matplotlib, seaborn.
To set up the environment and run the project, follow these steps:
- Clone the repository
git clone https://github.com/Martinaa1408/LB2_project_Group_5.git
cd LB2_project_Group_5- Create and activate a virtual environment
python -m venv env
source env/bin/activate # On Windows use: env\Scripts\activate- Install the required libraries
pip install numpy pandas scikit-learn matplotlib seaborn- Run the Main Scripts
-
Data_Collection/
Retrieval of raw datasets from UniProtKB.
→ Main notebook:Data_Collection.ipynb -
Data_Preparation/
Redundancy reduction with MMseqs2 and generation of train/benchmark sets with CV folds.
→ Key scripts:clustering.sh,get_tsv.py,get_sets.py -
Data_Analysis/
Exploratory analysis of datasets (length distributions, amino acid composition, taxonomy, cleavage motifs).
→ Main notebook:DataAnalysis.ipynb -
von_Heijne/
Implementation of the von Heijne (1986) statistical method for SP cleavage site prediction.
→ Main notebook:vonHeijne.ipynb -
Features_extraction/ Feature extraction from N-terminal regions (AA composition + physicochemical scales) and export of the feature matrix.
→ Main notebook:Features_extraction_SVM.ipynb -
SVM_method/ Implementation of the Support Vector Machine (SVM) classifier with feature selection and hyperparameter optimization.
→ Main notebook:SVM_Method.ipynb -
Evaluation_and_Comparison/ Comparative analysis of SVM and von Heijne models using independent test sets and performance metrics visualization.
→ Main notebook:Evaluation_and_Comparisons.ipynb -
Deep_Learning/ Implementation of a Multi-Layer Perceptron (MLP) classifier using Transfer Learning from pre-extracted ESM2 protein embeddings, optimized via Optuna for best performance. → Main notebook:
DL_analysis.ipynb -
Supplementary_materials/ Additional resources, figures, and intermediate files supporting the main notebooks and report.
→ PDF documents:Supplementary_materials_LB2_Group5.pdf;Manual_LB2_Group5.pdf -
Report/ PDF version of the final report related to the project. →PDF document:
report_LB2_Group5.pdf -
README.md
General project overview and workflow description. -
LICENSE
Open-source license (GPL-3.0).
This section summarizes the core data, feature extraction, and predictive performance achieved by the two implemented models — Von Heijne (rule-based) and SVM (RBF kernel).
| Phase | Description | SP⁺ | SP⁻ | Total | Notes |
|---|---|---|---|---|---|
| Raw UniProtKB | Manually reviewed S. cerevisiae proteins with signal peptide annotation | 2,932 | 20,615 | 23,547 | Experimental annotations only |
| After MMseqs2 clustering (30% ID) | Non-redundant representative sequences | 1,093 | 8,934 | 10,027 | Removes homolog redundancy |
| Final dataset split | 80% training + 20% independent benchmark | 874 | 7,147 | 8,021 (train) 219 / 1,787 (bench) |
Balanced and taxonomically representative |
| Feature Category | Description | Example Features | Count |
|---|---|---|---|
| Amino acid composition | Residue frequencies in N-terminal region (–30 to +2 aa) | comp_L, comp_A, comp_V | 19 |
| Hydrophobicity | Kyte–Doolittle mean & max | hydro_mean, hydro_max | 2 |
| Charge distribution | Mean charge, max charge | charge_mean, charge_max | 2 |
| Secondary structure | α-helix propensity (Chou–Fasman scale) | alpha_mean, alpha_max | 2 |
| Transmembrane propensity | Mean & max TM index | trans_mean, trans_max | 2 |
| Residue size/volume | Mean and maximum residue volume | size_mean, size_max | 2 |
Total features extracted: 29
Features selected for SVM (RF importance): 15
| Step | Method | Details / Parameters | Output |
|---|---|---|---|
| Model 1 | Von Heijne (rule-based) | Position-Specific Weight Matrix (PSWM), optimized threshold by MCC | Cleavage-site scoring function |
| Model 2 | SVM (RBF kernel) | C = 10, γ = 'scale', kernel = RBF; Stratified 5-fold CV |
Trained classifier on 15 features |
| Model 3 | Deep Learning (ESM-2 + MLP) | hidden_size1': 35, 'hidden_size2': 35, 'hidden_size3': 34, Dropout: 0.11525485528721369, LR: 0.0001552002336009008; Optuna optimization |
Trained neural network on ESM-2 embeddings |
| Metric | Von Heijne | SVM (RBF) | Deep Learning |
|---|---|---|---|
| Accuracy | 0.939 ± 0.002 | 0.927 | 0.995 |
| Precision | 0.708 ± 0.017 | 0.620 | 0.987 |
| Recall (TPR) | 0.756 ± 0.032 | 0.857 | 0.970 |
| F1-score | 0.728 ± 0.011 | 0.719 | 0.981 |
| MCC | 0.697 ± 0.013 | 0.6913 | 0.978 |
| Metric | Von Heijne | SVM (RBF) | Deep Learning |
|---|---|---|---|
| Accuracy | 0.930 | 0.922 | 0.990 |
| Precision | 0.665 | 0.594 | 0.971 |
| Recall (TPR) | 0.726 | 0.895 | 0.936 |
| F1-score | 0.694 | 0.714 | 0.953 |
| MCC | 0.656 | 0.690 | 0.948 |
| Aspect | Von Heijne | SVM | Deep Learning |
|---|---|---|---|
| False Positives (FP) | Hydrophobic TM helices (Metazoa bias) | Strongly reduced; fewer TM-related misclassifications | Minimal (6 FP); balanced error distribution |
| False Negatives (FN) | Short or polar SPs (<18 aa) | Borderline SPs with weak α-helix signals | Minimal (14 FN); robust to sequence variations |
| Motif capture | Conserved [A,V]XA cleavage motif | Broader tolerance to sequence variability | Automatic feature learning; no manual motif definition |
| SP mean length | 22.4 aa | 21.9 aa | No length bias detected |
| Interpretability | High (biological motifs visible) | Moderate (feature-dependent) | Lower (black-box) but superior performance |
| Dataset | Model | Accuracy | F1-score | MCC | Best For |
|---|---|---|---|---|---|
| Training / Validation | Von Heijne | 0.939 | 0.728 | 0.697 | Baseline biological interpretability |
| SVM (RBF) | 0.927 | 0.719 | 0.691 | Pattern learning and discrimination | |
| Deep Learning | 0.995 | 0.981 | 0.978 | Maximum predictive performance | |
| Benchmark (Independent) | Von Heijne | 0.930 | 0.694 | 0.656 | Motif-based baseline |
| SVM (RBF) | 0.921 | 0.714 | 0.690 | Robust generalization | |
| Deep Learning | 0.990 | 0.953 | 0.948 | State-of-the-art classification |
The MLP leveraging ESM-2 embeddings outperforms both the SVM and rule-based models on all metrics, capturing canonical and atypical signal peptides with near-perfect accuracy and robust generalization.
The Von Heijne PSWM remains biologically interpretable and complements the MLP by providing motif-level insight into cleavage-site conservation.
This project has been developed by the following group members:
- Martina Castellucci – @Martinaa1408
- Alessia Corica – @alessia-corica
- Anna Rossi – @AnnaRossi01
- Sofia Natale – @sofianatale
This project is released under the GPL-3.0 License.
This project is part of the Laboratory of Bioinformatics II course (University of Bologna, 2025). We would like to thank Professors Castrense Savojardo and Matteo Manfredi for their guidance, feedback and continuous support throughout the project.
- MMseqs2 — clustering and redundancy reduction
- Python 3 — data preprocessing and analysis
- Biopython — sequence handling, FASTA/TSV parsing, and biological data processing
- scikit-learn (sklearn) — machine learning framework (SVM, evaluation metrics, preprocessing)
- NumPy — numerical computation and matrix operations
- Seaborn — statistical data visualization
- ProtScale (ExPASy) — computation of physicochemical property scales (e.g. hydrophobicity)
- AAindex - is a curated database of numerical indices describing the physicochemical and biochemical properties of amino acids.
- SwissProt statistics — summary of protein counts, taxonomy coverage, and annotation status in UniProtKB/SwissProt releases.
- WebLogo generator — tool for visualizing sequence motifs and residue conservation (used for cleavage site motif analysis).
- PyTorch — deep learning framework for neural network modeling.
- Bash utils — quick FASTA/TSV operations
- Jupyter / Google Colab — environment for interactive workflows
- conda environment tools — package and environment management
- UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase. Nucleic Acids Research.
- von Heijne G. (1986). A new method for predicting signal sequence cleavage sites. Nucleic Acids Research.
- Cortes C. & Vapnik V. (1995). Support-Vector Networks. Machine Learning, 20(3): 273–297.
- Kyte J. & Doolittle R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol.
- Chou P.Y. & Fasman G.D. (1978). Prediction of protein conformation. Biochemistry.
- Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science.
- scikit-learn SVM documentation
- MLP-tensorflow