VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

🚀 Introduction

VenusX is a large-scale benchmark for fine-grained protein functional annotation and pairing at residue, fragment, and domain levels.

Key features:

Three task categories with six annotation types:
- Residue-level binary classification
- Fragment-level multi-class classification
- Unsupervised local structure pairing
Six functional annotation types: Active sites, Binding sites, Conserved sites, Motifs, Domains, Epitopes
Dataset characteristics:
- Over 878,000 samples from InterPro, BioLiP, and SAbDab databases
- Mixed-family and cross-family splits at 50%, 70%, and 90% sequence identity thresholds

📑 Results

Paper Results

VenusX benchmarks fine-grained protein understanding across multiple subprotein levels through three tasks:

residue-level binary classification: identifying functionally important residues,
fragment-level multi-class classification: classifying fragments by biological role,
pairwise functional similarity scoring: matching functionally similar proteins or substructures without requiring explicit function labels.

Benchmarking protein models on VenusX reveals performance gaps between global and fine-grained tasks, highlighting the need for more robust and interpretable models.

Baselines

Table: Summary of baseline models (methods) by input modality

Task indicates evaluation scope:

"All" for all three tasks
"Sup." for supervised tasks only
"Pair" for unsupervised pairwise similarity

Type	Model (Method)	Version	Task	# Params	# Train. Params	Embed. Dim	Implementation
Sequence-Only	ESM2	t30	All	150M	410K	640	HF: ESM2-t30
	ESM2	t33	All	652M	1.6M	1,280	HF: ESM2-t33
	ESM2	t36	Pair	3,000M	--	2,560	HF: ESM2-t36
	ESM1b	t33	Pair	652M	--	1,280	HF: ESM-1b
	ProtBert	uniref	All	420M	1.0M	1,024	HF: ProtBert
	ProtT5	xl_uniref50	Pair	3,000M	--	1,024	HF: ProtT5
	Ankh	base	All	450M	591K	768	HF: Ankh
	TM-vec	swiss_large	Pair	3,034M	--	512	GitHub: TM-vec
	ProstT5	AA2fold	Pair	3,000M	--	1024	HF: ProstT5
	BLAST	--	Pair	--	--	--	Conda: BLAST
Sequence-Structure	SaProt	35M_AF2	All	35M	231K	480	HF: SaProt-AF2
	SaProt	650M_PDB	All	650M	1.6M	1,280	HF: SaProt-PDB
	ProtSSN	k20_h512	All	800M	1.6M	1,280	HF: ProtSSN
	ESM-IF1	--	Pair	148M	--	512	HF: ESM-IF1
	MIS-ST	--	Pair	643M	--	256	GitHub: MIF-ST
	Foldseek	3Di-AA	Pair	--	--	--	Conda: Foldseek
Structure-Only	GVP-GNN	3-layers	Sup.	3M	3M	512	GitHub: GVP
	Foldseek	3Di	Pair	--	--	--	Conda: Foldseek
	TM-align	mean	Pair	--	--	--	Conda: TM-align

🛫 Requirement

Conda Environment

Please make sure you have installed Anaconda3 or Miniconda3.

git clone https://github.com/AI4Protein/VenusX.git
cd VenusX

You can create the required environment using the following two methods.

conda env create -f environment.yaml
conda activate VenusX

or

conda create -n VenusX python=3.8.18
conda activate VenusX
pip install -r requirements.txt

Hardware

All data processing, baseline experiments were conducted on 16 NVIDIA RTX 4090 GPUs. If you plan to experiment with deep learning models with larger parameters, additional hardware resources may be necessary.

🧬 Start with VenuX

Dataset Information

The dataset for the VenuX Benchmark can be viewed and downloaded at Huggingface:AI4Protein:Venux_Dataset. For example: AI4Protein/VenusX_Res_Act_MF50 refers to an active site dataset used for identifying functionally important residues. The dataset is clustered based on 50% fragment similarity and divided into training, validation and test sets according to mixed families.

Train

(1) The VenusX/script/example/train/train_token_cls.sh script demonstrates how to train a deep learning model for identifying functionally important residues.

(2) The VenusX/script/example/train/train_fragment_cls.sh script demonstrates how to train a deep learning model for classifying fragments according to biological roles.

Compute protein (fragment) embeddings

Folder VenusX/script/example/embedding contains scripts for obtaining protein or fragment embeddings using deep learning models and traditional methods. Note: Please set the path of the dataset in the script according to the actual situation.

🙌 Citation

If you find this work useful, please consider citing:

📝 License

This project is licensed under the terms of the CC-BY-NC-ND-4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
img		img
script		script
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

🚀 Introduction

📑 Results

Paper Results

Baselines

🛫 Requirement

Conda Environment

Hardware

🧬 Start with VenuX

Dataset Information

Train

Compute protein (fragment) embeddings

🙌 Citation

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ai4protein/VenusX

Folders and files

Latest commit

History

Repository files navigation

VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

🚀 Introduction

📑 Results

Paper Results

Baselines

🛫 Requirement

Conda Environment

Hardware

🧬 Start with VenuX

Dataset Information

Train

Compute protein (fragment) embeddings

🙌 Citation

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages