Skip to content

instadeepai/GeoGraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins

Visual summary

This repository contains the official code for the paper: "GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins" (arXiv:2510.00774).

📖 About

GeoGraph is a lightweight, simulation-informed transformer model that predicts aggregate geometric properties of intrinsically disordered protein (IDP) conformational ensembles directly from their amino acid sequence.

Given a protein sequence, GeoGraph infers the ensemble-averaged values for:

  • End-to-end distance ($R_e$)
  • Radius of gyration ($R_g$)
  • Asphericity ($\Delta$)
  • Flory scaling exponent ($\nu$)
  • Flory scaling prefactor ($A_0$)

🔧 Installation

We use uv for environment management (for installation see the official documentation).

Once uv is installed, run uv sync from the project root to setup your local virtual environment.

⚡ Quickstart

Loading the Model

First, download the pre-trained model checkpoint from Hugging Face Hub:

import torch
from geograph.model.geograph import GeoGraph
from huggingface_hub import hf_hub_download

# Download the checkpoint
ckpt_path = hf_hub_download(
    repo_id="jeanq1/GeoGraph", 
    filename="model.ckpt", 
    local_dir="./storage"
)

# Load hyperparameters and state dict
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model = GeoGraph(**ckpt['hyper_parameters'])
model.load_state_dict(ckpt['state_dict'])
model.eval()

Inferring Geometric Properties

You can get predictions for a list of sequences as follows:

sequences = [
    "MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ",
    "PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR",
]

# Tokenize inputs and run inference
inputs = model.tokenize(sequences)
with torch.no_grad():
    outputs = model(**inputs)
    
geometric_features = outputs['geometric_features']

Extracting Embeddings

To get the final-layer residue-level embeddings, set the return_embeddings=True flag.

with torch.no_grad():
    outputs = model(
        return_embeddings=True, 
        return_geometric_features=False, 
        **inputs
    )
    
embeddings = outputs['embeddings']

📊 Evaluation on the Human-IDRome Dataset

Dataset

GeoGraph was trained and evaluated on the Human–IDRome dataset, which contains 28,058 IDP ensembles generated with the CALVADOS-2 coarse-grained force field.

We provide our 80/10/10 sequence-similarity-based splits and the precomputed target features. You can download this metadata from Hugging Face:

import pandas as pd
from huggingface_hub import hf_hub_download

df_path = hf_hub_download(
    repo_id="jeanq1/GeoGraph", 
    filename="splits_and_features.csv", 
    local_dir="./storage"
)
df = pd.read_csv(df_path)

print(df[df['split'] == 'test'].head())

Reproducing Results

To perform the evaluation simply run the following script:

uv run scripts/run_evaluation.py

This will load the test split, run inference with the GeoGraph model, and produce the following figure to the figures directory: Evaluation results.

Building the Dataset (Optional)

If you wish to download the full Human-IDRome dataset and compute the geometric features from scratch, you can use the script provided in scripts/download_dataset.py and scripts/compute_geometric_features.py.

📦 Additional Contributions

In addition to the GeoGraph model, our work included the curation of IDP-Euka-90, a large-scale dataset of 30 million IDP sequences, and the fine-tuning of two ESM-2 models (8M and 150M) on this data.

You can access these resources on Hugging Face:

🤝 Cite our work

If you have used GeoGraph in your work, you can cite us using the following bibtex entry:

@article{quinn2025geograph,
  title={GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins},
  author={Quinn, Eoin and Carobene, Marco and Quentin, Jean and Boyer, Sebastien and Arbes{\'u}, Miguel and Bent, Oliver},
  journal={arXiv preprint arXiv:2510.00774},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages