Are there reliable benchmarks showing Graphify improves coding agent performance on large repos? #1328

real-worlds · 2026-06-15T23:28:01Z

real-worlds
Jun 15, 2026

Without task-success benchmarks, it is hard to distinguish Graphify from a useful visualization/context-compression tool versus something that actually improves coding agent capability.

FolatheDuckofDuckingburg · 2026-07-04T17:10:11Z

FolatheDuckofDuckingburg
Jul 4, 2026

I've created a reproducible benchmark framework to measure whether Graphify improves agent performance: https://github.com/FolatheDuckofDuckingburg/graphify/tree/v8/benchmarks

This includes:

16 concrete benchmark tasks (bug fixes, features, refactoring, architecture Q&A)
Paired comparative trial design (with/without Graphify)
Statistical rigor (McNemar's test, effect sizes, 95% CI)
Task evaluator and runner scripts

Ready to run the first benchmarks to answer your question!

0 replies

TPAteeq · 2026-07-04T18:32:17Z

TPAteeq
Jul 4, 2026

@real-worlds — this is exactly the question we've been benchmarking for the last few weeks, and @FolatheDuckofDuckingburg's framework lands at a perfect time (more on that below). Short answer: yes — measured agentically on a large production repo, graphify improves agent accuracy by +11 points over a grep/read agent at essentially zero added cost per task.

Setup. We ran a fixed coding agent (Claude Opus 4.8, ≤14 turns, real API token usage measured from usage objects — not estimates) against ERPNext (a large production ERP codebase). Every configuration gets the same floor tools (grep + read_file + list_dir); each treatment adds exactly one code-intelligence tool. Answers are graded against pre-authored gold key-facts on hard cross-file questions. Same agent, same turn budget, same grader — the only variable is the tool.

Results (agent capability, ERPNext):

configuration	answer accuracy	tokens vs floor	$/task	m$ per coverage-pt
+ graphify	82.0%	1.3×	$0.320	3.91
+ codebase_memory (code-graph binary + embeddings)	80.5%	2.7×	$0.599	7.44
+ repomix (whole-repo packing, every turn)	80.5%	30.0×	$3.936	48.86
+ codegraphcontext (FalkorDB code graph)	79.2%	2.2×	$0.391	4.94
+ claude_context (semantic search, Ollama + Milvus)	79.1%	1.4×	$0.340	4.29
grep/read floor (no tool)	70.8%	1.0×	$0.322	4.54

A few things worth calling out:

+11.2 points of accuracy over the floor at the same dollar cost per task ($0.320 vs $0.322) — the graph's overhead is offset by the agent finishing in fewer turns (14.3 → 11.8). The agent doesn't just answer better; it stops wandering.
Against the "just pack the whole repo into context" approach: graphify scores higher than repomix at 1/23rd the tokens and 1/12th the cost per task. On large repos, whole-repo packing is a token trap; a queryable graph is not.
Among the graph/context-index tools we tested, graphify came out on top on both accuracy and cost — and it's the only one whose $/task is indistinguishable from bare grep. It's also the only one in that group needing no external service (codegraphcontext needs FalkorDB; claude_context needs Milvus + Ollama running) and the only one with a fully deterministic, LLM-free build: $0 to index, reproducible byte-for-byte. A couple of narrow single-purpose tools (structural pattern search, LSP symbol navigation) trade in a different lane — full 11-tool table in the report for anyone who wants the complete picture.
Token economics is the real differentiator at scale. Across the 11 tools we tested, accuracy converges once an agent can iterate — but token cost varies 30× between tools. That's the axis that decides what's actually usable on a big codebase.

On the "large repos" part specifically — graphify's deterministic AST build scales without an LLM in the loop: kafka (126k nodes / 463k edges) builds in ~3.5 min, moodle (472k nodes) in ~7 min, at $0. We also ran a 15-year longitudinal sweep of ERPNext itself (689 weekly checkpoints, 2011→2026): graph quality improves monotonically as the repo grows — call-edge density more than doubles and orphan nodes drop from 29% to under 7% — so the graph gets more useful precisely as the codebase gets harder to hold in your head.

@FolatheDuckofDuckingburg — your framework is the perfect complement to this: you're measuring end-to-end task success (paired trials + McNemar's is the right rigor for it), while ours measures answer accuracy + token economics under a fixed agent. Together they cover both halves of the OP's question. Happy to contribute our gold-fact query methodology, and I'd be glad to help run the first paired trials on your 16 tasks — between the two harnesses we'd have capability and cost covered, reproducibly.

happy to share details and discuss more in this thread.

0 replies

TPAteeq · 2026-07-04T19:01:42Z

TPAteeq
Jul 4, 2026

Follow-up with a second set of results, since "does the graph actually help an agent?" has a sibling question: can graphify's architecture serve as a conversational long-term memory — the mem0 / supermemory problem — rather than just a code index? We benchmarked that too.

Setup. Two datasets, one identical harness so nothing hides in methodology differences:

LongMemEval-S (n=50 stratified) and LOCOMO (n=300 stratified across all 10 conversations, 75 questions each of multi-hop / temporal / open-domain / single-hop; adversarial category excluded for every system).
Same everything for every system: same reader LLM answering from each system's retrieved context (Kimi K2.6), same judge (we blind-validated it against an independent second judge — 90.6% agreement, κ=0.81), same local embedder where the system allows one (BGE-m3), same fixed top-10 retrieval budget, ingest once per conversation. The only variable is the memory system.
The graphify configuration is an experimental engine implementing graphify's retrieval architecture over conversation turns — turn-level nodes embedded into SurrealDB (HNSW) with hybrid dense+lexical seeds and graph expansion. Flagging honestly: this is a benchmark prototype of the architecture, not the shipped package as-is.

Results (LOCOMO, n=300, identical Kimi-judged harness — selected rows, full 7-system table in the report):

system	QA overall	turn recall@5	NDCG@10	ingest cost
supermemory (self-hosted)	0.50	0.14*	0.14*	~$16
graphify + SurrealDB (experimental)	0.43	0.42	0.36	$0
plain dense RAG (BGE-m3)	0.41	0.36	0.30	$0
BM25	0.31	0.28	0.24	$0
mem0 (OSS, same LLM/embedder)	0.27	0.04	0.03	~$2

* supermemory ships its own fixed internal embedder, so its retrieval numbers aren't directly comparable — QA is the cleaner axis.

What this says:

graphify's engine beat mem0 decisively (0.43 vs 0.27 QA) — with zero ingest cost against mem0's per-session LLM extraction, and exact turn-level provenance (every hit maps to a real conversation turn; mem0's consolidated memories mostly can't be traced back). Fairness note: mem0's own published LOCOMO number (92.5%) is measured at a top-200 retrieval budget with a GPT-4o answerer/judge — a different measurement. Under one identical harness at an identical top-10 budget, the table above is the comparison.
On LongMemEval, graphify's engine posted best-tier session recall (0.92) at $0 ingest, with QA in a dead heat across all systems — dedicated LLM-extraction systems bought no answer-quality advantage there.
supermemory won LOCOMO overall — and the entire gap is one category. Outside multi-hop reasoning, graphify's engine won temporal (0.69 vs 0.68) and open-domain (0.32 vs 0.29) and tied single-hop (0.55 vs 0.56) — all within noise at n=75/category. supermemory's lead comes from multi-hop, where its LLM-distilled memory representation genuinely shines. It pays for it: ~$16 ingest per benchmark corpus (~200× graphify's $0) through a closed pipeline, versus a deterministic, fully inspectable $0 build.
Early experiment on exactly that gap: we prototyped a consolidation pass on the graphify architecture (distill each session into entity-centric facts, deterministic cross-session entity merge) and re-ran the identical benchmark — it scored QA 0.52 / multi-hop 0.47, edging supermemory (0.50 / 0.45) at ~1/20th of its ingest cost, with turn-level provenance preserved. Experimental, clearly — but it suggests the multi-hop gap is a representation choice, not an architectural ceiling.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Are there reliable benchmarks showing Graphify improves coding agent performance on large repos? #1328

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Uh oh!

Are there reliable benchmarks showing Graphify improves coding agent performance on large repos? #1328

Uh oh!

real-worlds Jun 15, 2026

Replies: 3 comments

Uh oh!

Uh oh!

FolatheDuckofDuckingburg Jul 4, 2026

Uh oh!

TPAteeq Jul 4, 2026

Uh oh!

TPAteeq Jul 4, 2026

real-worlds
Jun 15, 2026

FolatheDuckofDuckingburg
Jul 4, 2026

TPAteeq
Jul 4, 2026

TPAteeq
Jul 4, 2026