VenusX is a large-scale benchmark for fine-grained protein functional annotation and pairing at residue, fragment, and domain levels.
Key features:
-
Three task categories with six annotation types:
- Residue-level binary classification
- Fragment-level multi-class classification
- Unsupervised local structure pairing
-
Six functional annotation types: Active sites, Binding sites, Conserved sites, Motifs, Domains, Epitopes
-
Dataset characteristics:
- Over 878,000 samples from InterPro, BioLiP, and SAbDab databases
- Mixed-family and cross-family splits at 50%, 70%, and 90% sequence identity thresholds
VenusX benchmarks fine-grained protein understanding across multiple subprotein levels through three tasks:
- residue-level binary classification: identifying functionally important residues,
- fragment-level multi-class classification: classifying fragments by biological role,
- pairwise functional similarity scoring: matching functionally similar proteins or substructures without requiring explicit function labels.
Benchmarking protein models on VenusX reveals performance gaps between global and fine-grained tasks, highlighting the need for more robust and interpretable models.
Table: Summary of baseline models (methods) by input modality
Task indicates evaluation scope:
- "All" for all three tasks
- "Sup." for supervised tasks only
- "Pair" for unsupervised pairwise similarity
Type | Model (Method) | Version | Task | # Params | # Train. Params | Embed. Dim | Implementation |
---|---|---|---|---|---|---|---|
Sequence-Only | ESM2 | t30 | All | 150M | 410K | 640 | HF: ESM2-t30 |
ESM2 | t33 | All | 652M | 1.6M | 1,280 | HF: ESM2-t33 | |
ESM2 | t36 | Pair | 3,000M | -- | 2,560 | HF: ESM2-t36 | |
ESM1b | t33 | Pair | 652M | -- | 1,280 | HF: ESM-1b | |
ProtBert | uniref | All | 420M | 1.0M | 1,024 | HF: ProtBert | |
ProtT5 | xl_uniref50 | Pair | 3,000M | -- | 1,024 | HF: ProtT5 | |
Ankh | base | All | 450M | 591K | 768 | HF: Ankh | |
TM-vec | swiss_large | Pair | 3,034M | -- | 512 | GitHub: TM-vec | |
ProstT5 | AA2fold | Pair | 3,000M | -- | 1024 | HF: ProstT5 | |
BLAST | -- | Pair | -- | -- | -- | Conda: BLAST | |
Sequence-Structure | SaProt | 35M_AF2 | All | 35M | 231K | 480 | HF: SaProt-AF2 |
SaProt | 650M_PDB | All | 650M | 1.6M | 1,280 | HF: SaProt-PDB | |
ProtSSN | k20_h512 | All | 800M | 1.6M | 1,280 | HF: ProtSSN | |
ESM-IF1 | -- | Pair | 148M | -- | 512 | HF: ESM-IF1 | |
MIS-ST | -- | Pair | 643M | -- | 256 | GitHub: MIF-ST | |
Foldseek | 3Di-AA | Pair | -- | -- | -- | Conda: Foldseek | |
Structure-Only | GVP-GNN | 3-layers | Sup. | 3M | 3M | 512 | GitHub: GVP |
Foldseek | 3Di | Pair | -- | -- | -- | Conda: Foldseek | |
TM-align | mean | Pair | -- | -- | -- | Conda: TM-align |
Please make sure you have installed Anaconda3 or Miniconda3.
git clone https://github.com/AI4Protein/VenusX.git
cd VenusX
You can create the required environment using the following two methods.
conda env create -f environment.yaml
conda activate VenusX
or
conda create -n VenusX python=3.8.18
conda activate VenusX
pip install -r requirements.txt
All data processing, baseline experiments were conducted on 16 NVIDIA RTX 4090 GPUs. If you plan to experiment with deep learning models with larger parameters, additional hardware resources may be necessary.
The dataset for the VenuX Benchmark can be viewed and downloaded at Huggingface:AI4Protein:Venux_Dataset. For example: AI4Protein/VenusX_Res_Act_MF50 refers to an active site dataset used for identifying functionally important residues. The dataset is clustered based on 50% fragment similarity and divided into training, validation and test sets according to mixed families.
(1) The VenusX/script/example/train/train_token_cls.sh
script demonstrates how to train a deep learning model for identifying functionally important residues.
(2) The VenusX/script/example/train/train_fragment_cls.sh
script demonstrates how to train a deep learning model for classifying fragments according to biological roles.
Folder VenusX/script/example/embedding
contains scripts for obtaining protein or fragment embeddings using deep learning models and traditional methods. Note: Please set the path of the dataset in the script according to the actual situation.
If you find this work useful, please consider citing:
This project is licensed under the terms of the CC-BY-NC-ND-4.0 license.