Skip to content

ai4protein/VenusX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

🚀 Introduction

VenusX is a large-scale benchmark for fine-grained protein functional annotation and pairing at residue, fragment, and domain levels.

Key features:

  • Three task categories with six annotation types:

    • Residue-level binary classification
    • Fragment-level multi-class classification
    • Unsupervised local structure pairing
  • Six functional annotation types: Active sites, Binding sites, Conserved sites, Motifs, Domains, Epitopes

  • Dataset characteristics:

    • Over 878,000 samples from InterPro, BioLiP, and SAbDab databases
    • Mixed-family and cross-family splits at 50%, 70%, and 90% sequence identity thresholds

Logo

📑 Results

Paper Results

VenusX benchmarks fine-grained protein understanding across multiple subprotein levels through three tasks:

  • residue-level binary classification: identifying functionally important residues,
  • fragment-level multi-class classification: classifying fragments by biological role,
  • pairwise functional similarity scoring: matching functionally similar proteins or substructures without requiring explicit function labels.

Benchmarking protein models on VenusX reveals performance gaps between global and fine-grained tasks, highlighting the need for more robust and interpretable models.

Baselines

Table: Summary of baseline models (methods) by input modality

Task indicates evaluation scope:

  • "All" for all three tasks
  • "Sup." for supervised tasks only
  • "Pair" for unsupervised pairwise similarity
Type Model (Method) Version Task # Params # Train. Params Embed. Dim Implementation
Sequence-Only ESM2 t30 All 150M 410K 640 HF: ESM2-t30
ESM2 t33 All 652M 1.6M 1,280 HF: ESM2-t33
ESM2 t36 Pair 3,000M -- 2,560 HF: ESM2-t36
ESM1b t33 Pair 652M -- 1,280 HF: ESM-1b
ProtBert uniref All 420M 1.0M 1,024 HF: ProtBert
ProtT5 xl_uniref50 Pair 3,000M -- 1,024 HF: ProtT5
Ankh base All 450M 591K 768 HF: Ankh
TM-vec swiss_large Pair 3,034M -- 512 GitHub: TM-vec
ProstT5 AA2fold Pair 3,000M -- 1024 HF: ProstT5
BLAST -- Pair -- -- -- Conda: BLAST
Sequence-Structure SaProt 35M_AF2 All 35M 231K 480 HF: SaProt-AF2
SaProt 650M_PDB All 650M 1.6M 1,280 HF: SaProt-PDB
ProtSSN k20_h512 All 800M 1.6M 1,280 HF: ProtSSN
ESM-IF1 -- Pair 148M -- 512 HF: ESM-IF1
MIS-ST -- Pair 643M -- 256 GitHub: MIF-ST
Foldseek 3Di-AA Pair -- -- -- Conda: Foldseek
Structure-Only GVP-GNN 3-layers Sup. 3M 3M 512 GitHub: GVP
Foldseek 3Di Pair -- -- -- Conda: Foldseek
TM-align mean Pair -- -- -- Conda: TM-align

🛫 Requirement

Conda Environment

Please make sure you have installed Anaconda3 or Miniconda3.

git clone https://github.com/AI4Protein/VenusX.git
cd VenusX

You can create the required environment using the following two methods.

conda env create -f environment.yaml
conda activate VenusX

or

conda create -n VenusX python=3.8.18
conda activate VenusX
pip install -r requirements.txt

Hardware

All data processing, baseline experiments were conducted on 16 NVIDIA RTX 4090 GPUs. If you plan to experiment with deep learning models with larger parameters, additional hardware resources may be necessary.

🧬 Start with VenuX

Dataset Information

The dataset for the VenuX Benchmark can be viewed and downloaded at Huggingface:AI4Protein:Venux_Dataset. For example: AI4Protein/VenusX_Res_Act_MF50 refers to an active site dataset used for identifying functionally important residues. The dataset is clustered based on 50% fragment similarity and divided into training, validation and test sets according to mixed families.

Logo

Logo

Train

(1) The VenusX/script/example/train/train_token_cls.sh script demonstrates how to train a deep learning model for identifying functionally important residues.

(2) The VenusX/script/example/train/train_fragment_cls.sh script demonstrates how to train a deep learning model for classifying fragments according to biological roles.

Compute protein (fragment) embeddings

Folder VenusX/script/example/embedding contains scripts for obtaining protein or fragment embeddings using deep learning models and traditional methods. Note: Please set the path of the dataset in the script according to the actual situation.

🙌 Citation

If you find this work useful, please consider citing:

📝 License

This project is licensed under the terms of the CC-BY-NC-ND-4.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published