Skip to content

build_merge / --update drops existing graph's hyperedges from unchanged files (only nodes+edges are read from graph.json) #1574

Description

@socar-tender

Summary

build_merge() — the function backing graphify --update — reads only nodes and edges from the existing graph.json, never hyperedges. As a result, every incremental update silently drops all hyperedges belonging to files that weren't re-extracted in that run. After a full build the graph carries all its hyperedges; the first --update that touches even one file collapses the graph's hyperedge set down to just the hyperedges of the changed file(s).

This is the highest-signal part of the semantic layer (domain-flow groupings), so the loss is disproportionately damaging to query/explain quality.

Version: graphifyy==0.9.3.

Root cause

In build_merge() (build.py:687), the existing graph is loaded with only:

existing_nodes = list(data.get("nodes", []))   # build.py:719
existing_edges = list(data.get(links_key, [])) # build.py:720

data.get("hyperedges") is never read. The merged graph's hyperedges are then set solely from the new chunks in build():

hyperedges = extraction.get("hyperedges", [])
if hyperedges:
    ...
    G.graph["hyperedges"] = hyperedges         # build.py:591  (plain overwrite)

to_json() faithfully writes whatever build_merge left (export.py:536), so the output graph ends up with only the changed files' hyperedges. --update reaches this via _build_merge(...) at __main__.py:4844.

Why this looks like an unintended gap (not by design)

The codebase already knows how to preserve hyperedges across a merge — build_merge just doesn't use either mechanism:

  • attach_hyperedges() (export.py:464) merges new hyperedges into the existing set with id-level dedup:
    existing = G.graph.get("hyperedges", [])
    seen_ids = {h["id"] for h in existing}
    for h in hyperedges:
        if h.get("id") and h["id"] not in seen_ids:
            existing.append(h)
    G.graph["hyperedges"] = existing
  • The watch path explicitly carries existing hyperedges forward (watch.py:682):
    "hyperedges": existing.get("hyperedges", []),

Only the build_merge / --update path plain-overwrites from the new chunks.

Reproduction

  1. Full-build a repo that produces hyperedges from several doc files (graphify .). Note the hyperedge count in graph.json (.graph.hyperedges / top-level hyperedges).
  2. Modify a single file and run graphify --update.
  3. graph.json now contains only the hyperedges extracted from that one changed file; every hyperedge from the untouched files is gone.

Observed in practice: a repo whose semantic cache holds 57 hyperedges across its doc files had graph.json collapse to 2 hyperedges after an --update that re-extracted 2 files. The 55 lost hyperedges were all from unchanged files and remain intact in graphify-out/cache/semantic/* — only the merged graph dropped them.

Impact

Suggested fix

Have build_merge() read data.get("hyperedges", []) from the existing graph and merge it with the new chunks' hyperedges via attach_hyperedges() (id-dedup), with re-extracted files' hyperedges replacing their prior contribution by source_file — mirroring how nodes/edges are already replaced-per-source in build_merge, and how watch.py already preserves them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions