GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins

This repository contains the official code for the paper: "GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins" (arXiv:2510.00774).

📖 About

GeoGraph is a lightweight, simulation-informed transformer model that predicts aggregate geometric properties of intrinsically disordered protein (IDP) conformational ensembles directly from their amino acid sequence.

Given a protein sequence, GeoGraph infers the ensemble-averaged values for:

End-to-end distance ($R_e$)
Radius of gyration ($R_g$)
Asphericity ($\Delta$)
Flory scaling exponent ($\nu$)
Flory scaling prefactor ($A_0$)

🔧 Installation

We use uv for environment management (for installation see the official documentation).

Once uv is installed, run uv sync from the project root to setup your local virtual environment.

⚡ Quickstart

Loading the Model

First, download the pre-trained model checkpoint from Hugging Face Hub:

import torch
from geograph.model.geograph import GeoGraph
from huggingface_hub import hf_hub_download

# Download the checkpoint
ckpt_path = hf_hub_download(
    repo_id="jeanq1/GeoGraph", 
    filename="model.ckpt", 
    local_dir="./storage"
)

# Load hyperparameters and state dict
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model = GeoGraph(**ckpt['hyper_parameters'])
model.load_state_dict(ckpt['state_dict'])
model.eval()

Inferring Geometric Properties

You can get predictions for a list of sequences as follows:

sequences = [
    "MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ",
    "PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR",
]

# Tokenize inputs and run inference
inputs = model.tokenize(sequences)
with torch.no_grad():
    outputs = model(**inputs)
    
geometric_features = outputs['geometric_features']

Extracting Embeddings

To get the final-layer residue-level embeddings, set the return_embeddings=True flag.

with torch.no_grad():
    outputs = model(
        return_embeddings=True, 
        return_geometric_features=False, 
        **inputs
    )
    
embeddings = outputs['embeddings']

📊 Evaluation on the Human-IDRome Dataset

Dataset

GeoGraph was trained and evaluated on the Human–IDRome dataset, which contains 28,058 IDP ensembles generated with the CALVADOS-2 coarse-grained force field.

We provide our 80/10/10 sequence-similarity-based splits and the precomputed target features. You can download this metadata from Hugging Face:

import pandas as pd
from huggingface_hub import hf_hub_download

df_path = hf_hub_download(
    repo_id="jeanq1/GeoGraph", 
    filename="splits_and_features.csv", 
    local_dir="./storage"
)
df = pd.read_csv(df_path)

print(df[df['split'] == 'test'].head())

Reproducing Results

To perform the evaluation simply run the following script:

uv run scripts/run_evaluation.py

This will load the test split, run inference with the GeoGraph model, and produce the following figure to the figures directory:

Building the Dataset (Optional)

If you wish to download the full Human-IDRome dataset and compute the geometric features from scratch, you can use the script provided in scripts/download_dataset.py and scripts/compute_geometric_features.py.

📦 Additional Contributions

In addition to the GeoGraph model, our work included the curation of IDP-Euka-90, a large-scale dataset of 30 million IDP sequences, and the fine-tuning of two ESM-2 models (8M and 150M) on this data.

You can access these resources on Hugging Face:

IDP-Euka-90 dataset: jeanq1/IDP-Euka-90
IDP-ESM2-8M model: jeanq1/IDP-ESM2-8M
IDP-ESM2-150M model: jeanq1/IDP-ESM2-150M

🤝 Cite our work

If you have used GeoGraph in your work, you can cite us using the following bibtex entry:

@article{quinn2025geograph,
  title={GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins},
  author={Quinn, Eoin and Carobene, Marco and Quentin, Jean and Boyer, Sebastien and Arbes{\'u}, Miguel and Bent, Oliver},
  journal={arXiv preprint arXiv:2510.00774},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figures		figures
scripts		scripts
src/geograph		src/geograph
tests/geograph		tests/geograph
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins

📖 About

🔧 Installation

⚡ Quickstart

Loading the Model

Inferring Geometric Properties

Extracting Embeddings

📊 Evaluation on the Human-IDRome Dataset

Dataset

Reproducing Results

Building the Dataset (Optional)

📦 Additional Contributions

🤝 Cite our work

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

instadeepai/GeoGraph

Folders and files

Latest commit

History

Repository files navigation

GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins

📖 About

🔧 Installation

⚡ Quickstart

Loading the Model

Inferring Geometric Properties

Extracting Embeddings

📊 Evaluation on the Human-IDRome Dataset

Dataset

Reproducing Results

Building the Dataset (Optional)

📦 Additional Contributions

🤝 Cite our work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages