Are there reliable benchmarks showing Graphify improves coding agent performance on large repos? #1328
Replies: 3 comments
-
|
I've created a reproducible benchmark framework to measure whether Graphify improves agent performance: https://github.com/FolatheDuckofDuckingburg/graphify/tree/v8/benchmarks This includes: 16 concrete benchmark tasks (bug fixes, features, refactoring, architecture Q&A) Ready to run the first benchmarks to answer your question! |
Beta Was this translation helpful? Give feedback.
-
|
@real-worlds — this is exactly the question we've been benchmarking for the last few weeks, and @FolatheDuckofDuckingburg's framework lands at a perfect time (more on that below). Short answer: yes — measured agentically on a large production repo, graphify improves agent accuracy by +11 points over a grep/read agent at essentially zero added cost per task. Setup. We ran a fixed coding agent (Claude Opus 4.8, ≤14 turns, real API token usage measured from Results (agent capability, ERPNext):
A few things worth calling out:
On the "large repos" part specifically — graphify's deterministic AST build scales without an LLM in the loop: kafka (126k nodes / 463k edges) builds in ~3.5 min, moodle (472k nodes) in ~7 min, at $0. We also ran a 15-year longitudinal sweep of ERPNext itself (689 weekly checkpoints, 2011→2026): graph quality improves monotonically as the repo grows — call-edge density more than doubles and orphan nodes drop from 29% to under 7% — so the graph gets more useful precisely as the codebase gets harder to hold in your head. @FolatheDuckofDuckingburg — your framework is the perfect complement to this: you're measuring end-to-end task success (paired trials + McNemar's is the right rigor for it), while ours measures answer accuracy + token economics under a fixed agent. Together they cover both halves of the OP's question. Happy to contribute our gold-fact query methodology, and I'd be glad to help run the first paired trials on your 16 tasks — between the two harnesses we'd have capability and cost covered, reproducibly. happy to share details and discuss more in this thread. |
Beta Was this translation helpful? Give feedback.
-
|
Follow-up with a second set of results, since "does the graph actually help an agent?" has a sibling question: can graphify's architecture serve as a conversational long-term memory — the mem0 / supermemory problem — rather than just a code index? We benchmarked that too. Setup. Two datasets, one identical harness so nothing hides in methodology differences:
Results (LOCOMO, n=300, identical Kimi-judged harness — selected rows, full 7-system table in the report):
* supermemory ships its own fixed internal embedder, so its retrieval numbers aren't directly comparable — QA is the cleaner axis. What this says:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Without task-success benchmarks, it is hard to distinguish Graphify from a useful visualization/context-compression tool versus something that actually improves coding agent capability.
Beta Was this translation helpful? Give feedback.
All reactions