feat(build): warn on hub-node degree collapse during incremental merge (#1652b)#1661
Open
TPAteeq wants to merge 2 commits into
Open
feat(build): warn on hub-node degree collapse during incremental merge (#1652b)#1661TPAteeq wants to merge 2 commits into
TPAteeq wants to merge 2 commits into
Conversation
…e (#1652b) build_merge REPLACES each re-extracted file's prior nodes/edges. If a re-extraction emits a DIFFERENT id for an entity that already exists as a hub, the old hub and its in-file edges are dropped and its cross-file edges are orphaned onto a bare node — the hub silently collapses from many edges to ~0 while the node count may not shrink, so the count-based shrink guard never fires (it is also gated out of the normal dedup=True --update path anyway). Snapshot the pre-merge graph's hub degrees (nodes with degree >= HUB_DEGREE_MIN) before the replace filter, then after the merge WARN when a former hub vanishes or loses more than DEGREE_DROP_FRAC of its degree. This is a warning, not an error, because a genuine large refactor can legitimately shed a hub's edges. Active on every path, including dedup=True. Refs Graphify-Labs#1652 (sub-proposal b only); guards the Graphify-Labs#1651 collapse vector. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(#1652b) Review follow-up to 9c32b4c. Five fixes: - [primary] The degree-drop alert fired the misleading "silent data loss from a re-extraction changing an entity's id (Graphify-Labs#1651)" message on intentional prune_sources deletions — the normal way --update removes a deleted file — causing alarm fatigue on the exact signal meant to catch real corruption. build_merge now snapshots each hub's source_file and exempts any hub whose file is in prune_set (the same rule the node prune uses, mirroring the shrink guard's `not prune_sources` exemption). A hub that collapsed WITHOUT its file being pruned still warns; the genuine id-drift case still fires. - _hub_degrees threads `directed` and builds a DiGraph/Graph to match G.degree() on build_merge(directed=True), where a bidirectional pair is 2. - _hub_degrees counts an edge only when BOTH endpoints are in the node set, mirroring build_from_json (which drops dangling edges) instead of inflating the pre-merge degree via add_edge's implicit node creation. - Fix the "orphaned onto a bare node" wording (build_from_json DROPS the edge). - Tests: vanished-hub `-> 0 (node dropped entirely)` suffix; >10-hub truncation; exactly-50%-loss boundary (pins strict `>`); pure-prune silence; combined prune-exempt-but-id-drift-warns; plus directed and dangling-edge unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements only sub-proposal (b) of #1652 — a god-node degree-drop ALERT in
build_merge. It also doubles as the cheap safety net for the silent data-loss vector in #1651 (a hub node collapsing from many edges to ~0).The gap this closes
build_mergeREPLACES each re-extracted file's prior contribution: every node/edge whosesource_fileis in the re-extracted set is dropped before the merge rebuilds. If a re-extraction emits a different id for an entity that already exists as a well-connected hub, the old hub and its in-file edges are dropped and its cross-file edges are orphaned onto a bare node. The hub silently collapses (e.g. 174 edges → ~0) — and the total node count can even rise, so the existing count-based shrink guard (#479) never catches it. That guard is also gated onnot dedup and not prune_sources, so it never fires on the normal CLI--updatepath (which always callsbuild_merge(dedup=True)).What this does
degree >= HUB_DEGREE_MIN) from the graph as loaded, before the replace-per-source filter runs.WARNING(styled like the existing prune warnings) when a hub vanishes or loses more thanDEGREE_DROP_FRACof its degree, reporting the label and before → after edge counts. Many drops aggregate into a concise, capped summary.dedup=True— unlike the count-based shrink guard.Thresholds are module-level constants in
build.py:HUB_DEGREE_MIN = 20,DEGREE_DROP_FRAC = 0.5(plus a display cap_HUB_DROP_REPORT_LIMIT = 10).Explicitly out of scope
backup_if_protected.)Tests
New
tests/test_build_merge_degree_drop.py:dedup=Truepath;_hub_degrees/_hub_degree_drops.Command:
uv run python -m pytest tests/test_build_merge_degree_drop.py tests/test_build.py tests/test_build_merge_hyperedges_and_prune.py tests/test_merge_graphs_cli.py -q→ all passing (6 new + 46 existing).Refs #1652