Machine Learning Techniques for Protein Function Prediction

This project investigates the application of classical machine learning techniques to the problem of protein function prediction based solely on amino acid sequences. The task is formulated as a supervised multi-class classification problem, using functional annotations derived from the UniProt and Gene Ontology (GO) databases.

Project Overview

Proteins play a fundamental role in biological systems, and understanding their molecular functions is a central challenge in bioinformatics. While experimental annotation is accurate, it is costly and time-consuming, motivating the use of computational approaches.

In this project, different sequence encoding strategies and machine learning models are compared to assess how feature representation impacts predictive performance. Particular attention is given to handling high-dimensional biological data and class imbalance, both of which are common in large-scale protein datasets.

Dataset

Source: UniProt
Number of entries: ~570,000 proteins
Fields used:
- Sequence: amino acid sequence of the protein
- Gene Ontology IDs: molecular function annotations

Functional Classes

Protein functions are derived from Gene Ontology (GO) Molecular Function terms. Since proteins are typically annotated with multiple fine-grained GO terms, a preprocessing step projects these annotations onto a reduced set of 10 biologically meaningful macro-classes using a GO-Slim mapping and the GO Directed Acyclic Graph (DAG).

Each protein is assigned exactly one functional label, making the dataset suitable for supervised learning.

Feature Representation

Several encoding strategies are explored:

1. Label Encoding (Baseline)

Each protein sequence is mapped to a unique integer identifier.
Serves as a computational baseline.
Does not preserve biological or sequential information.

2. k-mer Encoding

Protein sequences are decomposed into overlapping substrings of length k (3-mers).
Each sequence is represented as a sparse frequency vector.
L1 normalization is applied to account for variable sequence lengths.
Captures local sequence patterns relevant to protein function.

3. Feature Engineering on k-mers

Additional biologically motivated features are introduced:

TF-IDF on k-mers: emphasizes discriminative sequence motifs.
Amino Acid Composition (AAC): global frequency of the 20 standard amino acids.
Dipeptide Composition (DPC): frequency of all amino acid pairs, capturing short-range order.

These features are concatenated into a high-dimensional sparse representation.

Machine Learning Models

The following classifiers are evaluated:

Decision Tree
Random Forest
Extra Trees
XGBoost
Linear Support Vector Machine (SVM)

Model Selection and Evaluation

Hyperparameters are optimized using GridSearchCV.
Stratified K-Fold Cross-Validation is employed to preserve class distributions.
Due to strong class imbalance, macro-averaged precision, recall, and F1-score are used as primary evaluation metrics.

Results Summary

Simple label encoding yields limited performance, confirming that raw categorical representations are insufficient.
k-mer–based representations significantly improve classification accuracy.
Feature engineering (TF-IDF, AAC, DPC) further enhances performance by combining local and global sequence information.
XGBoost and Linear SVM achieve the best overall results when paired with enriched feature spaces.
Ensemble methods (Random Forest, Extra Trees) show stable performance but are less effective at exploiting very high-dimensional sparse features.

Key Takeaways

Feature representation is more critical than model complexity for protein function prediction.
k-mer–based encodings provide a strong and biologically meaningful foundation.
Boosting and margin-based classifiers are particularly well-suited for sparse biological data.
Proper handling of class imbalance is essential for reliable evaluation.

Technologies Used

Python
scikit-learn
XGBoost
NumPy / SciPy
pandas
Gene Ontology tools

Author

This project was developed as part of an academic study on machine learning applications in bioinformatics.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ProteinFunctionPrediction.ipynb		ProteinFunctionPrediction.ipynb
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Techniques for Protein Function Prediction

Project Overview

Dataset

Functional Classes

Feature Representation

1. Label Encoding (Baseline)

2. k-mer Encoding

3. Feature Engineering on k-mers

Machine Learning Models

Model Selection and Evaluation

Results Summary

Key Takeaways

Technologies Used

Author

About

Uh oh!

Releases

Packages

Languages

athos-innocenti/ProteinFunctionPrediction

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Techniques for Protein Function Prediction

Project Overview

Dataset

Functional Classes

Feature Representation

1. Label Encoding (Baseline)

2. k-mer Encoding

3. Feature Engineering on k-mers

Machine Learning Models

Model Selection and Evaluation

Results Summary

Key Takeaways

Technologies Used

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages