Skip to content

Classical machine learning techniques to the problem of protein function prediction based solely on amino acid sequences. The task is formulated as a supervised multi-class classification problem, using functional annotations derived from the UniProt and Gene Ontology databases.

Notifications You must be signed in to change notification settings

athos-innocenti/ProteinFunctionPrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Machine Learning Techniques for Protein Function Prediction

This project investigates the application of classical machine learning techniques to the problem of protein function prediction based solely on amino acid sequences. The task is formulated as a supervised multi-class classification problem, using functional annotations derived from the UniProt and Gene Ontology (GO) databases.

Project Overview

Proteins play a fundamental role in biological systems, and understanding their molecular functions is a central challenge in bioinformatics. While experimental annotation is accurate, it is costly and time-consuming, motivating the use of computational approaches.

In this project, different sequence encoding strategies and machine learning models are compared to assess how feature representation impacts predictive performance. Particular attention is given to handling high-dimensional biological data and class imbalance, both of which are common in large-scale protein datasets.

Dataset

  • Source: UniProt
  • Number of entries: ~570,000 proteins
  • Fields used:
    • Sequence: amino acid sequence of the protein
    • Gene Ontology IDs: molecular function annotations

Functional Classes

Protein functions are derived from Gene Ontology (GO) Molecular Function terms. Since proteins are typically annotated with multiple fine-grained GO terms, a preprocessing step projects these annotations onto a reduced set of 10 biologically meaningful macro-classes using a GO-Slim mapping and the GO Directed Acyclic Graph (DAG).

Each protein is assigned exactly one functional label, making the dataset suitable for supervised learning.

Feature Representation

Several encoding strategies are explored:

1. Label Encoding (Baseline)

  • Each protein sequence is mapped to a unique integer identifier.
  • Serves as a computational baseline.
  • Does not preserve biological or sequential information.

2. k-mer Encoding

  • Protein sequences are decomposed into overlapping substrings of length k (3-mers).
  • Each sequence is represented as a sparse frequency vector.
  • L1 normalization is applied to account for variable sequence lengths.
  • Captures local sequence patterns relevant to protein function.

3. Feature Engineering on k-mers

Additional biologically motivated features are introduced:

  • TF-IDF on k-mers: emphasizes discriminative sequence motifs.
  • Amino Acid Composition (AAC): global frequency of the 20 standard amino acids.
  • Dipeptide Composition (DPC): frequency of all amino acid pairs, capturing short-range order.

These features are concatenated into a high-dimensional sparse representation.

Machine Learning Models

The following classifiers are evaluated:

  • Decision Tree
  • Random Forest
  • Extra Trees
  • XGBoost
  • Linear Support Vector Machine (SVM)

Model Selection and Evaluation

  • Hyperparameters are optimized using GridSearchCV.
  • Stratified K-Fold Cross-Validation is employed to preserve class distributions.
  • Due to strong class imbalance, macro-averaged precision, recall, and F1-score are used as primary evaluation metrics.

Results Summary

  • Simple label encoding yields limited performance, confirming that raw categorical representations are insufficient.
  • k-mer–based representations significantly improve classification accuracy.
  • Feature engineering (TF-IDF, AAC, DPC) further enhances performance by combining local and global sequence information.
  • XGBoost and Linear SVM achieve the best overall results when paired with enriched feature spaces.
  • Ensemble methods (Random Forest, Extra Trees) show stable performance but are less effective at exploiting very high-dimensional sparse features.

Key Takeaways

  • Feature representation is more critical than model complexity for protein function prediction.
  • k-mer–based encodings provide a strong and biologically meaningful foundation.
  • Boosting and margin-based classifiers are particularly well-suited for sparse biological data.
  • Proper handling of class imbalance is essential for reliable evaluation.

Technologies Used

  • Python
  • scikit-learn
  • XGBoost
  • NumPy / SciPy
  • pandas
  • Gene Ontology tools

Author

This project was developed as part of an academic study on machine learning applications in bioinformatics.

About

Classical machine learning techniques to the problem of protein function prediction based solely on amino acid sequences. The task is formulated as a supervised multi-class classification problem, using functional annotations derived from the UniProt and Gene Ontology databases.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published