Summary
build_merge() — the function backing graphify --update — reads only nodes and edges from the existing graph.json, never hyperedges. As a result, every incremental update silently drops all hyperedges belonging to files that weren't re-extracted in that run. After a full build the graph carries all its hyperedges; the first --update that touches even one file collapses the graph's hyperedge set down to just the hyperedges of the changed file(s).
This is the highest-signal part of the semantic layer (domain-flow groupings), so the loss is disproportionately damaging to query/explain quality.
Version: graphifyy==0.9.3.
Root cause
In build_merge() (build.py:687), the existing graph is loaded with only:
existing_nodes = list(data.get("nodes", [])) # build.py:719
existing_edges = list(data.get(links_key, [])) # build.py:720
data.get("hyperedges") is never read. The merged graph's hyperedges are then set solely from the new chunks in build():
hyperedges = extraction.get("hyperedges", [])
if hyperedges:
...
G.graph["hyperedges"] = hyperedges # build.py:591 (plain overwrite)
to_json() faithfully writes whatever build_merge left (export.py:536), so the output graph ends up with only the changed files' hyperedges. --update reaches this via _build_merge(...) at __main__.py:4844.
Why this looks like an unintended gap (not by design)
The codebase already knows how to preserve hyperedges across a merge — build_merge just doesn't use either mechanism:
attach_hyperedges() (export.py:464) merges new hyperedges into the existing set with id-level dedup:
existing = G.graph.get("hyperedges", [])
seen_ids = {h["id"] for h in existing}
for h in hyperedges:
if h.get("id") and h["id"] not in seen_ids:
existing.append(h)
G.graph["hyperedges"] = existing
- The
watch path explicitly carries existing hyperedges forward (watch.py:682):
"hyperedges": existing.get("hyperedges", []),
Only the build_merge / --update path plain-overwrites from the new chunks.
Reproduction
- Full-build a repo that produces hyperedges from several doc files (
graphify .). Note the hyperedge count in graph.json (.graph.hyperedges / top-level hyperedges).
- Modify a single file and run
graphify --update.
graph.json now contains only the hyperedges extracted from that one changed file; every hyperedge from the untouched files is gone.
Observed in practice: a repo whose semantic cache holds 57 hyperedges across its doc files had graph.json collapse to 2 hyperedges after an --update that re-extracted 2 files. The 55 lost hyperedges were all from unchanged files and remain intact in graphify-out/cache/semantic/* — only the merged graph dropped them.
Impact
Suggested fix
Have build_merge() read data.get("hyperedges", []) from the existing graph and merge it with the new chunks' hyperedges via attach_hyperedges() (id-dedup), with re-extracted files' hyperedges replacing their prior contribution by source_file — mirroring how nodes/edges are already replaced-per-source in build_merge, and how watch.py already preserves them.
Summary
build_merge()— the function backinggraphify --update— reads onlynodesandedgesfrom the existinggraph.json, neverhyperedges. As a result, every incremental update silently drops all hyperedges belonging to files that weren't re-extracted in that run. After a full build the graph carries all its hyperedges; the first--updatethat touches even one file collapses the graph's hyperedge set down to just the hyperedges of the changed file(s).This is the highest-signal part of the semantic layer (domain-flow groupings), so the loss is disproportionately damaging to
query/explainquality.Version:
graphifyy==0.9.3.Root cause
In
build_merge()(build.py:687), the existing graph is loaded with only:data.get("hyperedges")is never read. The merged graph's hyperedges are then set solely from the new chunks inbuild():to_json()faithfully writes whateverbuild_mergeleft (export.py:536), so the output graph ends up with only the changed files' hyperedges.--updatereaches this via_build_merge(...)at__main__.py:4844.Why this looks like an unintended gap (not by design)
The codebase already knows how to preserve hyperedges across a merge — build_merge just doesn't use either mechanism:
attach_hyperedges()(export.py:464) merges new hyperedges into the existing set with id-level dedup:watchpath explicitly carries existing hyperedges forward (watch.py:682):Only the
build_merge/--updatepath plain-overwrites from the new chunks.Reproduction
graphify .). Note the hyperedge count ingraph.json(.graph.hyperedges/ top-levelhyperedges).graphify --update.graph.jsonnow contains only the hyperedges extracted from that one changed file; every hyperedge from the untouched files is gone.Observed in practice: a repo whose semantic cache holds 57 hyperedges across its doc files had
graph.jsoncollapse to 2 hyperedges after an--updatethat re-extracted 2 files. The 55 lost hyperedges were all from unchanged files and remain intact ingraphify-out/cache/semantic/*— only the merged graph dropped them.Impact
graph.htmlwhen the viz node limit is exceeded —graph.jsonis unaffected there). Here the loss is ingraph.jsonitself.members/node_idsalias keys silently dropped (onlynodesis read) #1561 (member-list alias keys during extraction) and Native-backend extraction prompt never requests hyperedges (drifted from skill extraction-spec) #1430 (extraction prompt drift). This is purely the merge/preservation step.Suggested fix
Have
build_merge()readdata.get("hyperedges", [])from the existing graph and merge it with the new chunks' hyperedges viaattach_hyperedges()(id-dedup), with re-extracted files' hyperedges replacing their prior contribution bysource_file— mirroring how nodes/edges are already replaced-per-source inbuild_merge, and howwatch.pyalready preserves them.