This project investigates the application of classical machine learning techniques to the problem of protein function prediction based solely on amino acid sequences. The task is formulated as a supervised multi-class classification problem, using functional annotations derived from the UniProt and Gene Ontology (GO) databases.
Proteins play a fundamental role in biological systems, and understanding their molecular functions is a central challenge in bioinformatics. While experimental annotation is accurate, it is costly and time-consuming, motivating the use of computational approaches.
In this project, different sequence encoding strategies and machine learning models are compared to assess how feature representation impacts predictive performance. Particular attention is given to handling high-dimensional biological data and class imbalance, both of which are common in large-scale protein datasets.
- Source: UniProt
- Number of entries: ~570,000 proteins
- Fields used:
Sequence: amino acid sequence of the proteinGene Ontology IDs: molecular function annotations
Protein functions are derived from Gene Ontology (GO) Molecular Function terms. Since proteins are typically annotated with multiple fine-grained GO terms, a preprocessing step projects these annotations onto a reduced set of 10 biologically meaningful macro-classes using a GO-Slim mapping and the GO Directed Acyclic Graph (DAG).
Each protein is assigned exactly one functional label, making the dataset suitable for supervised learning.
Several encoding strategies are explored:
- Each protein sequence is mapped to a unique integer identifier.
- Serves as a computational baseline.
- Does not preserve biological or sequential information.
- Protein sequences are decomposed into overlapping substrings of length k (3-mers).
- Each sequence is represented as a sparse frequency vector.
- L1 normalization is applied to account for variable sequence lengths.
- Captures local sequence patterns relevant to protein function.
Additional biologically motivated features are introduced:
- TF-IDF on k-mers: emphasizes discriminative sequence motifs.
- Amino Acid Composition (AAC): global frequency of the 20 standard amino acids.
- Dipeptide Composition (DPC): frequency of all amino acid pairs, capturing short-range order.
These features are concatenated into a high-dimensional sparse representation.
The following classifiers are evaluated:
- Decision Tree
- Random Forest
- Extra Trees
- XGBoost
- Linear Support Vector Machine (SVM)
- Hyperparameters are optimized using GridSearchCV.
- Stratified K-Fold Cross-Validation is employed to preserve class distributions.
- Due to strong class imbalance, macro-averaged precision, recall, and F1-score are used as primary evaluation metrics.
- Simple label encoding yields limited performance, confirming that raw categorical representations are insufficient.
- k-mer–based representations significantly improve classification accuracy.
- Feature engineering (TF-IDF, AAC, DPC) further enhances performance by combining local and global sequence information.
- XGBoost and Linear SVM achieve the best overall results when paired with enriched feature spaces.
- Ensemble methods (Random Forest, Extra Trees) show stable performance but are less effective at exploiting very high-dimensional sparse features.
- Feature representation is more critical than model complexity for protein function prediction.
- k-mer–based encodings provide a strong and biologically meaningful foundation.
- Boosting and margin-based classifiers are particularly well-suited for sparse biological data.
- Proper handling of class imbalance is essential for reliable evaluation.
- Python
- scikit-learn
- XGBoost
- NumPy / SciPy
- pandas
- Gene Ontology tools
This project was developed as part of an academic study on machine learning applications in bioinformatics.