docs(experiments): does a knowledge graph beat plain docs for AI-assisted planning?#19
Open
nokternol wants to merge 2 commits into
Open
docs(experiments): does a knowledge graph beat plain docs for AI-assisted planning?#19nokternol wants to merge 2 commits into
nokternol wants to merge 2 commits into
Conversation
…update Adds the full writeup for a three-trial comparison of AI-assisted implementation planning with graphify (knowledge-graph tool) versus plain markdown docs plus ordinary grep/read exploration, run against this repo's rule-vocabulary and identity-resolution subsystems. Updates the plan-with-graph skill to capture query corrections as durable lessons for future retrieval. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The report claimed docs-only was "cheaper on every raw metric" and that plan quality was "equal-or-better" for docs-only in both trials. Both claims were contradicted by the report's own data: the graph-assisted agent made fewer tool calls in Trial 2 (21 vs 22), and the graph-assisted plan was the only one whose actual deliverable included a required doc fix the docs-only plan read, understood, and then silently dropped. Trial 2 is corrected from "docs-only wins" to "close and mixed" — each method had genuine, distinct strengths and the judge's original scores overstated the gap in one direction. Trial 1 remains a decisive result; it should not have been generalized onto Trial 2. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
My boss suggested that good, consistent markdown documentation is just as good as the graphify
knowledge-graph setup in this repo for AI-assisted planning. This PR is the writeup of testing that claim
directly, on this codebase, across three escalating trials — and it does not land where I expected.
Bottom line: Trial 1 was decisive against the graph — its imprecise concept-matching actively misdirected
the agent into a materially worse plan. Trial 2, deliberately chosen around one of the graph's own
cleanest structures to give it its best shot, is not a repeat of that result — it came out close and
mixed, not a second docs-only win (an earlier version of this PR overstated it as one; see
report.mdfor the correction, made after review). Trial 3 then locates the actual bugs behind Trial 1's failure,
patches them, and re-runs the failing case to confirm the diagnosis — upstream PR against
graphify#1445 contains the validated fix.
Files in this PR
report.md— the top-level writeup: methodology, Trial 1 and Trial 2 results, raw cost metrics, judge scores (with
one judge scoring error caught and corrected on record), and the overall cross-trial conclusion.
task-brief.md— Trial 1's task: add a
multi-selectfilter-rule type. Chosen blind — no attempt to favor eithermethod — but it turned out to contain a real trap (see below).
plan-A-graph-assisted.md/
plan-B-docs-only.md— Trial 1's two anonymized worker plans, judged blind.
trial-1-investigation-followup.md— direct re-verification of the graph-assisted worker's actual query output (not just its self-report),
showing
graphify explain "genres"resolving to an unrelated Storybook string literal andgraphify pathreturning a graph-theoretically-valid but architecturally-meaningless 6-hop route.trial-2-identity-resolution/task-brief.md— Trial 2's task, this time deliberately chosen around one of the graph's own cleanest, highest-degree
nodes (
mediaIdentity,IdentityResolutionJob), to give the tool its best plausible shot.trial-2-identity-resolution/plan-C-graph-assisted.md/
plan-D-docs-only.md— Trial 2's two anonymized plans.
trial-2-investigation-followup.md— re-verification of Trial 2's queries (including a query whose entity extraction collapsed to
['canonical', 'canonical', 'canonical']and returned unrelated design-token JSON), plus a documentedand corrected judge grading error (the judge credited a plan with a fix it never actually included —
verified directly against the plan text, not just re-asserted).
trial-3-fix-and-retest.md— the follow-through: upgrading graphify, finding the exact functions responsible for both failures,
patching them, verifying each patch against the original failing queries, performing manual semantic
enrichment on the one failure-path relationship that couldn't be reached by any algorithmic fix, and
re-running the actual worker agent to confirm the plan itself improves, not just the raw query output.
.agent/skills/plan-with-graph/SKILL.md— updated based on what Trial 3 found (see below).
The core finding
Both trials' failures traced to the same root cause, confirmed by reading graphify's own retrieval code
(
serve.py): node matching is purely lexical (exact/prefix/substring string comparison against nodelabels), with no semantic or embedding layer at all. It resolves cleanly when a query names an exact,
unambiguous symbol (
Cradle,IdentityResolutionJob) and fails — sometimes returning confidently wronganswers rather than an error — whenever a query describes a concept in different words than the code
uses, or collides with a common word used elsewhere as an unrelated identifier. An independent GitHub
issue, still open, reports the identical failure shape
on an unrelated codebase.
Trial 3 fixed the mechanical half of this (a scoring-tier collision in seed selection — the upstream PR
above) and confirmed via a live worker re-run that the fix closes a real gap: the graph-assisted agent, which
in Trial 1 mistakenly treated dynamically-sourced provider data as a static option list, now correctly
rules that out and reaches the same correct answer the docs-only agent found. But the fix did not reduce
cost — the corrected worker now cross-verifies every graph-sourced claim against source files rather than
trusting it, which costs more than either blind trust (the original failure) or not using the graph at all
(the docs-only baseline).
What the
plan-with-graphskill closes that the upstream PR can'tThe upstream PR fixes a mechanical bug: it stops one exact match from statistically drowning out
weaker-but-relevant matches during automated retrieval. It cannot fix the deeper gap Trial 3 also
surfaced: some relationships (e.g., "this rule's values are actually populated by that hook at runtime")
have no import edge in the graph at all, because they're behavioral facts, not structural ones. No amount
of reweighting recovers a connection that was never extracted as an edge.
That gap is exactly what a
plan-with-graphsession is positioned to close, and the skill previously hadno mechanism to do it — it would write design docs for the next AST rebuild to pick up, which only ever
re-derives structural facts, not behavioral ones. The updated skill now wires in
graphify save-result --outcome dead_end|correctedandgraphify reflect(shipped upstream in#1441) so that every time a graph query returns
something wrong during a session and a human corrects it, both the failure and the human's own phrasing of
the correct answer are captured as a durable lesson — not just the resolved fact, but the paraphrase
itself, since that paraphrase is exactly the vocabulary the lexical matcher doesn't index. The upstream PR
and the skill update are complementary, not redundant: one fixes how the tool weighs matches it can
already find; the other captures the matches it can never find on its own.
Honest limitations (spelled out in
report.mdand each investigation followup)still a small sample.
docs-only arm's advantage is partly "this repo's docs happen to be good," not a universal claim.
corrected on the record, which is a reason for confidence in the diligence, not a claim the process is
now infallible.
What I'd want to test next
A repo an order of magnitude larger, where a full grep/read sweep genuinely stops being feasible within a
reasonable budget — that's the condition under which precomputed graph structure would need to carry real
weight rather than being a shortcut to somewhere grep would have reached anyway.
Follow-up: Trial 4 — a genuinely fair "everything fixed" test (not yet run)
Trial 3's re-run is not sufficient evidence that graphify now works well in general, and this should not
be read as more conclusive than it is: the semantic enrichment applied there was reactive and specific
— keywords were hand-added to exactly the two nodes (
MEDIA_RULES,useMediaLookups) already known to bethe gap, then the same task was re-run and confirmed fixed. That validates the mechanism (keyword
enrichment can close that class of gap) but is circular as evidence that the tool now generally helps,
since the answer was pre-loaded before the question was asked again.
A fair Trial 4 needs four things, none of which have been done yet:
targeted fixes), and not hand-picked around a node already known to resolve cleanly (that repeats Trial
2's "cheating" problem in the other direction). Pick it the same way Trial 1's was picked: from an
independent architecture survey with no visibility into what the graph can or can't currently resolve.
nodes a task happens to need, this only means something if there's a general enrichment pass — e.g. an
LLM-authored
keywordsfield generated for every node across a whole subsystem (not cherry-pickednodes), produced before the task is chosen, the way graphify's own extraction pass would need to work
if this became a real feature.
LESSONS.mdwith realistic history, not zero history. Thereflectmechanism's whole valueproposition (upstream issue #1441) is accumulation over many sessions. A single fresh session has an
empty memory directory, so this variable can't be honestly tested in one shot — it should be named as an
untested variable rather than faked with a plausible-looking seeded history.
docs-only, all on the same new task — so the result shows the delta the fixes actually bought, rather
than just "it succeeds now" in isolation, which proves less.
Concrete plan for whoever picks this up
from an unexplored part of this codebase — same "small-sounding change, hidden cross-layer trap" shape,
but not yet examined by any prior trial or worker in this PR's history.
one whole subsystem (e.g. all of
server/services/or all ofsrc/hooks/) and author akeywordsfield for every node in it, blind to what task will later be asked.
safishamsi/graphify#1596 (or the upstream merge, if
landed by then).
task, and write it up as
trial-4-*.mdalongside the existing trial docs in this PR, following the samestructure (task brief, anonymized plans, raw metrics, judge verdict, investigation-followup verifying
any surprising query behavior directly rather than trusting self-reports).
Cost is comparable to Trials 1 or 2 individually (one survey agent, one enrichment pass, two worker
agents, one judge pass). Left undone here due to session budget — this section exists so a fresh session
can pick it up without re-deriving the design rationale above.