Skip to content

feat(build): warn on hub-node degree collapse during incremental merge (#1652b)#1661

Open
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:feat/build-merge-degree-drop-alert
Open

feat(build): warn on hub-node degree collapse during incremental merge (#1652b)#1661
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:feat/build-merge-degree-drop-alert

Conversation

@TPAteeq

@TPAteeq TPAteeq commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Implements only sub-proposal (b) of #1652 — a god-node degree-drop ALERT in build_merge. It also doubles as the cheap safety net for the silent data-loss vector in #1651 (a hub node collapsing from many edges to ~0).

The gap this closes

build_merge REPLACES each re-extracted file's prior contribution: every node/edge whose source_file is in the re-extracted set is dropped before the merge rebuilds. If a re-extraction emits a different id for an entity that already exists as a well-connected hub, the old hub and its in-file edges are dropped and its cross-file edges are orphaned onto a bare node. The hub silently collapses (e.g. 174 edges → ~0) — and the total node count can even rise, so the existing count-based shrink guard (#479) never catches it. That guard is also gated on not dedup and not prune_sources, so it never fires on the normal CLI --update path (which always calls build_merge(dedup=True)).

What this does

  • Snapshots the pre-merge graph's hub degrees (nodes with degree >= HUB_DEGREE_MIN) from the graph as loaded, before the replace-per-source filter runs.
  • After the merge, compares each former hub's degree in the final graph and emits a clear stderr WARNING (styled like the existing prune warnings) when a hub vanishes or loses more than DEGREE_DROP_FRAC of its degree, reporting the label and before → after edge counts. Many drops aggregate into a concise, capped summary.
  • Runs on every path, including dedup=True — unlike the count-based shrink guard.
  • Warns, does not raise. A genuine large refactor can legitimately shed a hub's edges, so aborting would be wrong.

Thresholds are module-level constants in build.py: HUB_DEGREE_MIN = 20, DEGREE_DROP_FRAC = 0.5 (plus a display cap _HUB_DROP_REPORT_LIMIT = 10).

Explicitly out of scope

Tests

New tests/test_build_merge_degree_drop.py:

  • a merge that collapses a hub (re-extraction emits a new id) triggers the warning on the dedup=True path;
  • benign re-extraction that preserves the hub's id + edges does not warn;
  • a small sub-threshold degree change does not warn;
  • a graph with no prior hub does not warn;
  • plus helper-level unit tests for _hub_degrees / _hub_degree_drops.

Command: uv run python -m pytest tests/test_build_merge_degree_drop.py tests/test_build.py tests/test_build_merge_hyperedges_and_prune.py tests/test_merge_graphs_cli.py -q → all passing (6 new + 46 existing).

Refs #1652

TPAteeq and others added 2 commits July 5, 2026 00:01
…e (#1652b)

build_merge REPLACES each re-extracted file's prior nodes/edges. If a
re-extraction emits a DIFFERENT id for an entity that already exists as a
hub, the old hub and its in-file edges are dropped and its cross-file edges
are orphaned onto a bare node — the hub silently collapses from many edges to
~0 while the node count may not shrink, so the count-based shrink guard never
fires (it is also gated out of the normal dedup=True --update path anyway).

Snapshot the pre-merge graph's hub degrees (nodes with degree >=
HUB_DEGREE_MIN) before the replace filter, then after the merge WARN when a
former hub vanishes or loses more than DEGREE_DROP_FRAC of its degree. This is
a warning, not an error, because a genuine large refactor can legitimately
shed a hub's edges. Active on every path, including dedup=True.

Refs Graphify-Labs#1652 (sub-proposal b only); guards the Graphify-Labs#1651 collapse vector.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(#1652b)

Review follow-up to 9c32b4c. Five fixes:

- [primary] The degree-drop alert fired the misleading "silent data loss from
  a re-extraction changing an entity's id (Graphify-Labs#1651)" message on intentional
  prune_sources deletions — the normal way --update removes a deleted file —
  causing alarm fatigue on the exact signal meant to catch real corruption.
  build_merge now snapshots each hub's source_file and exempts any hub whose
  file is in prune_set (the same rule the node prune uses, mirroring the shrink
  guard's `not prune_sources` exemption). A hub that collapsed WITHOUT its file
  being pruned still warns; the genuine id-drift case still fires.
- _hub_degrees threads `directed` and builds a DiGraph/Graph to match
  G.degree() on build_merge(directed=True), where a bidirectional pair is 2.
- _hub_degrees counts an edge only when BOTH endpoints are in the node set,
  mirroring build_from_json (which drops dangling edges) instead of inflating
  the pre-merge degree via add_edge's implicit node creation.
- Fix the "orphaned onto a bare node" wording (build_from_json DROPS the edge).
- Tests: vanished-hub `-> 0 (node dropped entirely)` suffix; >10-hub truncation;
  exactly-50%-loss boundary (pins strict `>`); pure-prune silence; combined
  prune-exempt-but-id-drift-warns; plus directed and dangling-edge unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant