This repository contains the official code for the paper: "GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins" (arXiv:2510.00774).
GeoGraph is a lightweight, simulation-informed transformer model that predicts aggregate geometric properties of intrinsically disordered protein (IDP) conformational ensembles directly from their amino acid sequence.
Given a protein sequence, GeoGraph infers the ensemble-averaged values for:
- End-to-end distance (
$R_e$ ) - Radius of gyration (
$R_g$ ) - Asphericity (
$\Delta$ ) - Flory scaling exponent (
$\nu$ ) - Flory scaling prefactor (
$A_0$ )
We use uv for environment management (for installation see the official documentation).
Once uv is installed, run uv sync from the project root to setup your local virtual environment.
First, download the pre-trained model checkpoint from Hugging Face Hub:
import torch
from geograph.model.geograph import GeoGraph
from huggingface_hub import hf_hub_download
# Download the checkpoint
ckpt_path = hf_hub_download(
repo_id="jeanq1/GeoGraph",
filename="model.ckpt",
local_dir="./storage"
)
# Load hyperparameters and state dict
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model = GeoGraph(**ckpt['hyper_parameters'])
model.load_state_dict(ckpt['state_dict'])
model.eval()You can get predictions for a list of sequences as follows:
sequences = [
"MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ",
"PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR",
]
# Tokenize inputs and run inference
inputs = model.tokenize(sequences)
with torch.no_grad():
outputs = model(**inputs)
geometric_features = outputs['geometric_features']To get the final-layer residue-level embeddings, set the return_embeddings=True flag.
with torch.no_grad():
outputs = model(
return_embeddings=True,
return_geometric_features=False,
**inputs
)
embeddings = outputs['embeddings']GeoGraph was trained and evaluated on the Human–IDRome dataset, which contains 28,058 IDP ensembles generated with the CALVADOS-2 coarse-grained force field.
We provide our 80/10/10 sequence-similarity-based splits and the precomputed target features. You can download this metadata from Hugging Face:
import pandas as pd
from huggingface_hub import hf_hub_download
df_path = hf_hub_download(
repo_id="jeanq1/GeoGraph",
filename="splits_and_features.csv",
local_dir="./storage"
)
df = pd.read_csv(df_path)
print(df[df['split'] == 'test'].head())To perform the evaluation simply run the following script:
uv run scripts/run_evaluation.pyThis will load the test split, run inference with the GeoGraph model, and produce the following figure to the figures directory:

If you wish to download the full Human-IDRome dataset and compute the geometric features from scratch, you can use the script provided in scripts/download_dataset.py and scripts/compute_geometric_features.py.
In addition to the GeoGraph model, our work included the curation of IDP-Euka-90, a large-scale dataset of 30 million IDP sequences, and the fine-tuning of two ESM-2 models (8M and 150M) on this data.
You can access these resources on Hugging Face:
- IDP-Euka-90 dataset: jeanq1/IDP-Euka-90
- IDP-ESM2-8M model: jeanq1/IDP-ESM2-8M
- IDP-ESM2-150M model: jeanq1/IDP-ESM2-150M
If you have used GeoGraph in your work, you can cite us using the following bibtex entry:
@article{quinn2025geograph,
title={GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins},
author={Quinn, Eoin and Carobene, Marco and Quentin, Jean and Boyer, Sebastien and Arbes{\'u}, Miguel and Bent, Oliver},
journal={arXiv preprint arXiv:2510.00774},
year={2025}
}
