Skip to content

docs(experiments): does a knowledge graph beat plain docs for AI-assisted planning?#19

Open
nokternol wants to merge 2 commits into
mainfrom
docs/graphify-experiment-writeup
Open

docs(experiments): does a knowledge graph beat plain docs for AI-assisted planning?#19
nokternol wants to merge 2 commits into
mainfrom
docs/graphify-experiment-writeup

Conversation

@nokternol

@nokternol nokternol commented Jul 1, 2026

Copy link
Copy Markdown
Owner

What this is

My boss suggested that good, consistent markdown documentation is just as good as the graphify
knowledge-graph setup in this repo for AI-assisted planning. This PR is the writeup of testing that claim
directly, on this codebase, across three escalating trials — and it does not land where I expected.

Bottom line: Trial 1 was decisive against the graph — its imprecise concept-matching actively misdirected
the agent into a materially worse plan. Trial 2, deliberately chosen around one of the graph's own
cleanest structures to give it its best shot, is not a repeat of that result — it came out close and
mixed, not a second docs-only win
(an earlier version of this PR overstated it as one; see report.md
for the correction, made after review). Trial 3 then locates the actual bugs behind Trial 1's failure,
patches them, and re-runs the failing case to confirm the diagnosis — upstream PR against
graphify#1445
contains the validated fix.

Files in this PR

  • report.md
    — the top-level writeup: methodology, Trial 1 and Trial 2 results, raw cost metrics, judge scores (with
    one judge scoring error caught and corrected on record), and the overall cross-trial conclusion.
  • task-brief.md
    — Trial 1's task: add a multi-select filter-rule type. Chosen blind — no attempt to favor either
    method — but it turned out to contain a real trap (see below).
  • plan-A-graph-assisted.md
    / plan-B-docs-only.md
    — Trial 1's two anonymized worker plans, judged blind.
  • trial-1-investigation-followup.md
    — direct re-verification of the graph-assisted worker's actual query output (not just its self-report),
    showing graphify explain "genres" resolving to an unrelated Storybook string literal and graphify path returning a graph-theoretically-valid but architecturally-meaningless 6-hop route.
  • trial-2-identity-resolution/task-brief.md
    — Trial 2's task, this time deliberately chosen around one of the graph's own cleanest, highest-degree
    nodes (mediaIdentity, IdentityResolutionJob), to give the tool its best plausible shot.
  • trial-2-identity-resolution/plan-C-graph-assisted.md
    / plan-D-docs-only.md
    — Trial 2's two anonymized plans.
  • trial-2-investigation-followup.md
    — re-verification of Trial 2's queries (including a query whose entity extraction collapsed to
    ['canonical', 'canonical', 'canonical'] and returned unrelated design-token JSON), plus a documented
    and corrected judge grading error (the judge credited a plan with a fix it never actually included —
    verified directly against the plan text, not just re-asserted).
  • trial-3-fix-and-retest.md
    — the follow-through: upgrading graphify, finding the exact functions responsible for both failures,
    patching them, verifying each patch against the original failing queries, performing manual semantic
    enrichment on the one failure-path relationship that couldn't be reached by any algorithmic fix, and
    re-running the actual worker agent to confirm the plan itself improves, not just the raw query output.
  • .agent/skills/plan-with-graph/SKILL.md
    — updated based on what Trial 3 found (see below).

The core finding

Both trials' failures traced to the same root cause, confirmed by reading graphify's own retrieval code
(serve.py): node matching is purely lexical (exact/prefix/substring string comparison against node
labels), with no semantic or embedding layer at all. It resolves cleanly when a query names an exact,
unambiguous symbol (Cradle, IdentityResolutionJob) and fails — sometimes returning confidently wrong
answers rather than an error — whenever a query describes a concept in different words than the code
uses, or collides with a common word used elsewhere as an unrelated identifier. An independent GitHub
issue
, still open, reports the identical failure shape
on an unrelated codebase.

Trial 3 fixed the mechanical half of this (a scoring-tier collision in seed selection — the upstream PR
above) and confirmed via a live worker re-run that the fix closes a real gap: the graph-assisted agent, which
in Trial 1 mistakenly treated dynamically-sourced provider data as a static option list, now correctly
rules that out and reaches the same correct answer the docs-only agent found. But the fix did not reduce
cost — the corrected worker now cross-verifies every graph-sourced claim against source files rather than
trusting it, which costs more than either blind trust (the original failure) or not using the graph at all
(the docs-only baseline).

What the plan-with-graph skill closes that the upstream PR can't

The upstream PR fixes a mechanical bug: it stops one exact match from statistically drowning out
weaker-but-relevant matches during automated retrieval. It cannot fix the deeper gap Trial 3 also
surfaced: some relationships (e.g., "this rule's values are actually populated by that hook at runtime")
have no import edge in the graph at all, because they're behavioral facts, not structural ones. No amount
of reweighting recovers a connection that was never extracted as an edge.

That gap is exactly what a plan-with-graph session is positioned to close, and the skill previously had
no mechanism to do it — it would write design docs for the next AST rebuild to pick up, which only ever
re-derives structural facts, not behavioral ones. The updated skill now wires in graphify save-result --outcome dead_end|corrected and graphify reflect (shipped upstream in
#1441) so that every time a graph query returns
something wrong during a session and a human corrects it, both the failure and the human's own phrasing of
the correct answer
are captured as a durable lesson — not just the resolved fact, but the paraphrase
itself, since that paraphrase is exactly the vocabulary the lexical matcher doesn't index. The upstream PR
and the skill update are complementary, not redundant: one fixes how the tool weighs matches it can
already find; the other captures the matches it can never find on its own.

Honest limitations (spelled out in report.md and each investigation followup)

  • n=1 per arm, per trial. Two trials point the same direction, which is a stronger signal than one, but
    still a small sample.
  • Both trials ran on one repo, of modest size (~500 files), with currently accurate documentation — the
    docs-only arm's advantage is partly "this repo's docs happen to be good," not a universal claim.
  • The judge is an LLM from the same model family as the workers; one grading error was caught and
    corrected on the record, which is a reason for confidence in the diligence, not a claim the process is
    now infallible.

What I'd want to test next

A repo an order of magnitude larger, where a full grep/read sweep genuinely stops being feasible within a
reasonable budget — that's the condition under which precomputed graph structure would need to carry real
weight rather than being a shortcut to somewhere grep would have reached anyway.

Follow-up: Trial 4 — a genuinely fair "everything fixed" test (not yet run)

Trial 3's re-run is not sufficient evidence that graphify now works well in general, and this should not
be read as more conclusive than it is: the semantic enrichment applied there was reactive and specific
— keywords were hand-added to exactly the two nodes (MEDIA_RULES, useMediaLookups) already known to be
the gap, then the same task was re-run and confirmed fixed. That validates the mechanism (keyword
enrichment can close that class of gap) but is circular as evidence that the tool now generally helps,
since the answer was pre-loaded before the question was asked again.

A fair Trial 4 needs four things, none of which have been done yet:

  1. A brand-new, blind-picked task. Not Trial 1's or Trial 2's task again (both are now contaminated by
    targeted fixes), and not hand-picked around a node already known to resolve cleanly (that repeats Trial
    2's "cheating" problem in the other direction). Pick it the same way Trial 1's was picked: from an
    independent architecture survey with no visibility into what the graph can or can't currently resolve.
  2. Enrichment done systematically, not by hand. Instead of manually writing keywords for the specific
    nodes a task happens to need, this only means something if there's a general enrichment pass — e.g. an
    LLM-authored keywords field generated for every node across a whole subsystem (not cherry-picked
    nodes), produced before the task is chosen, the way graphify's own extraction pass would need to work
    if this became a real feature.
  3. A LESSONS.md with realistic history, not zero history. The reflect mechanism's whole value
    proposition (upstream issue #1441) is accumulation over many sessions. A single fresh session has an
    empty memory directory, so this variable can't be honestly tested in one shot — it should be named as an
    untested variable rather than faked with a plausible-looking seeded history.
  4. Keep the three-way comparison: graph-with-general-fixes vs. the original broken graph vs.
    docs-only, all on the same new task — so the result shows the delta the fixes actually bought, rather
    than just "it succeeds now" in isolation, which proves less.

Concrete plan for whoever picks this up

  • Spawn an independent architecture-survey agent (mirroring Trials 1/2's methodology) to pick a new task
    from an unexplored part of this codebase — same "small-sounding change, hidden cross-layer trap" shape,
    but not yet examined by any prior trial or worker in this PR's history.
  • Before the task is chosen, run a systematic (not targeted) enrichment pass: have an agent read through
    one whole subsystem (e.g. all of server/services/ or all of src/hooks/) and author a keywords
    field for every node in it, blind to what task will later be asked.
  • Apply the general algorithmic fix from
    safishamsi/graphify#1596 (or the upstream merge, if
    landed by then).
  • Re-run the same three-way comparison (graph-assisted, docs-only, judge) used in Trials 1/2 on the new
    task, and write it up as trial-4-*.md alongside the existing trial docs in this PR, following the same
    structure (task brief, anonymized plans, raw metrics, judge verdict, investigation-followup verifying
    any surprising query behavior directly rather than trusting self-reports).

Cost is comparable to Trials 1 or 2 individually (one survey agent, one enrichment pass, two worker
agents, one judge pass). Left undone here due to session budget — this section exists so a fresh session
can pick it up without re-deriving the design rationale above.

nokternol and others added 2 commits July 1, 2026 21:14
…update

Adds the full writeup for a three-trial comparison of AI-assisted
implementation planning with graphify (knowledge-graph tool) versus
plain markdown docs plus ordinary grep/read exploration, run against
this repo's rule-vocabulary and identity-resolution subsystems.
Updates the plan-with-graph skill to capture query corrections as
durable lessons for future retrieval.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The report claimed docs-only was "cheaper on every raw metric" and
that plan quality was "equal-or-better" for docs-only in both trials.
Both claims were contradicted by the report's own data: the
graph-assisted agent made fewer tool calls in Trial 2 (21 vs 22), and
the graph-assisted plan was the only one whose actual deliverable
included a required doc fix the docs-only plan read, understood, and
then silently dropped.

Trial 2 is corrected from "docs-only wins" to "close and mixed" —
each method had genuine, distinct strengths and the judge's original
scores overstated the gap in one direction. Trial 1 remains a
decisive result; it should not have been generalized onto Trial 2.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant