GraphArena

This repository contains the official implementation for the ICLR 2025 paper:

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation
Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li
ICLR 2025

Environment Setup

conda create -n GraphArena
source activate GraphArena
conda install openai pandas numpy networkx pip
pip install pybind11
pip install rdkit ogb graph-walker

Dataset Preparation

Download and unzip dataset.zip from the google drive, which contains the processed dataset.
To build the dataset from scratch, download source.zip from the same link and run bash utils/build_dataset.sh.

Benchmarking LLMs

Replace YOUR_API_KEY in benchmark_LLM_API.py.

python benchmark_LLM_API.py \
  --llm {model} \
  --task {task_name} \
  --problem_num {N} \
  --example_num {K} \
  --difficulty {easy|hard} \
  --results ./results \
  --sleep 5
  --resume

Key Parameters:

--llm: Model shortname (e.g., gpt4, claude, llama8b)
--task: One of 10 graph tasks (e.g., TSP, MVC, Diameter)
--difficulty: easy (small graphs) or hard (large graphs)
--problem_num: Number of problems to evaluate (default: 500)
--example_num: Number of demonstrated examples (defualt: 1).
--sleep: API call cooldown (default: 5s)
--resume: Resume from the last evaluation.

Details about command-line arguments are available in both benchmark_LLM_API.py and utils/run_benchmark.sh.

To evaluate LLMs locally, use:

python benchmark_LLM_local.py --llm llama8b

Evaluated LLMs and corresponding accuracy score:

LLM short Name	Test Version & Date	P (small)	P (large)	NP (small)	NP (large)	Average
dsr1	deepseek-R1 (2025-02-15)	0.976	0.877	0.877	0.431	0.795
claude	claude-3.5-sonnet-20241022	0.822	0.587	0.478	0.072	0.495
doubao	doubao-1.5-pro (2025-02-15)	0.792	0.532	0.467	0.052	0.461
gpt4	gpt-4o-2024-08-06	0.769	0.435	0.473	0.063	0.435
glm	glm-4-plus (2024-09-30)	0.727	0.457	0.413	0.048	0.411
gpt4mini	gpt-4o-mini-2024-07-18	0.689	0.366	0.392	0.033	0.37
llama	meta-llama/Llama-3-70b-chat-hf (2024-05-30)	0.612	0.316	0.368	0.047	0.336
deepseek	deepseek-V2.5 (2024-09-30)	0.514	0.247	0.337	0.031	0.282
qwen72b	qwen2.5-72B-Instruct (2024-09-30)	0.590	0.399	0.206	0.007	0.29
llama8b	meta-llama/Llama-3-8b-chat-hf (2024-05-30)	0.285	0.094	0.202	0.019	0.15
gemma	google/gemma-1.1-7b-it (2024-05-30)	0.252	0.092	0.129	0.009	0.12

For detailed metrics and analysis, see our paper and reproduce/ notebooks.

Reproduction Guide

To reproduce the results from our manuscript, follow these steps:

Download and unzip results.zip from the google drive.
Run all jupyter notebooks in the reproduce folder

Note: The evaluation may take a few minutes to complete.

Citation

@inproceedings{tang2025grapharena,
  title={GraphArena: Evaluating and Improving Large Language Models on Graph Computation},
  author={Tang, Jianheng and Zhang, Qifan and Li, Yuhan and Chen, Nuo and Li, Jia},
  booktitle={International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=Y1r9yCMzeA}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
reproduce		reproduce
tasks		tasks
utils		utils
.gitignore		.gitignore
README.md		README.md
benchmark_GNN.py		benchmark_GNN.py
benchmark_LLM_API.py		benchmark_LLM_API.py
benchmark_LLM_local.py		benchmark_LLM_local.py
benchmark_LLM_parallel.py		benchmark_LLM_parallel.py
benchmark_LLM_sequential.py		benchmark_LLM_sequential.py
build_dataset.py		build_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GraphArena

Environment Setup

Dataset Preparation

Benchmarking LLMs

Reproduction Guide

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

squareRoot3/GraphArena

Folders and files

Latest commit

History

Repository files navigation

GraphArena

Environment Setup

Dataset Preparation

Benchmarking LLMs

Reproduction Guide

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages