feat(cluster): carry community LLM labels over re-clustering via member overlap#1662
Open
TPAteeq wants to merge 2 commits into
Open
feat(cluster): carry community LLM labels over re-clustering via member overlap#1662TPAteeq wants to merge 2 commits into
TPAteeq wants to merge 2 commits into
Conversation
…overlap (Graphify-Labs#1653) `cluster-only` invalidated a saved LLM community label whenever the exact membership signature changed, so a community that merely gained or lost a member was reset to a structural hub name. Relax the reuse gate: keep the saved label when the re-clustered community still overlaps the previous community of the same id by at least LABEL_CARRYOVER_MIN_JACCARD (0.75), else fall back to the hub label as before. The "run `graphify label`" nag now fires only for communities that genuinely lost their label. Adds community_overlap_ratios() alongside the existing overlap-based cid remap, plus unit tests for the ratio math and end-to-end cluster-only tests proving a one-member gain keeps the label while a mostly-replaced community drops it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rage (Graphify-Labs#1653) Review follow-up for the label carry-over feature: - Add the required `## Unreleased` CHANGELOG bullet for the feature. - Doc: note on community_overlap_ratios that the Jaccard is over SURVIVING nodes in graph.json, so it is intentionally insensitive to deletions (design is unchanged). - Cover the no-sig reuse branch (`or carried`): a fixture that omits the `.graphify_labels.json.sig` and asserts a >=0.75-overlap community still carries its label under a community-count mismatch. - Add overlap-ratio edge cases: the inclusive boundary at exactly 0.75 (keep), a single-member swap on a small community (3/5 = 0.6 -> drop), and a reused cid with partial overlap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1653
Problem
graphify cluster-onlyre-runs community detection and then reuses the saved.graphify_labels.json. To avoid resurrecting a stale label after a re-scope,the reuse loop guards each community with an exact membership signature
(
community_member_sigs, hashed over the community's exact sorted member set).If the signature differs at all, the saved (often LLM-generated) label is
discarded and the community is renamed to its structural hub (highest-degree
member).
That guard is too strict. A community that is "the same, give or take a member"
— it gained or lost a single node between labeling and the next
cluster-only—gets a different signature and so loses its curated LLM name, snapping back to
a hub name like
log_action. Community-ID drift is already handled(
remap_communities_to_previous, #822/#1027); the label loss here is causedspecifically by the exact-signature check, not by ID misalignment.
Fix
Relax the reuse gate from exact-signature-only to exact-signature OR
sufficient member overlap:
community_overlap_ratios(communities, previous_node_community)incluster.pycomputes, per community id, the Jaccard overlap|new ∩ old| / |new ∪ old|against the previous community that shared its id.It runs after
remap_communities_to_previoushas aligned ids, so cidX'smembers are the natural successor of the community whose saved label and
signature are also keyed on
X. It reuses the sameold_setsconstructionthe remap already relies on — no new persistence, no new matching machinery.
cluster-onlyreuse loop, a community now counts asfresh(keeps itssaved label) when its exact signature is unchanged or its overlap ratio is
at least
LABEL_CARRYOVER_MIN_JACCARD. Otherwise it falls back to the hublabel exactly as before.
graphify label" nag nowfires only for communities that actually lost their label, because a
carried-over community is
freshand never increments the changed counter.Chosen threshold
LABEL_CARRYOVER_MIN_JACCARD = 0.75(module-level constant incluster.py).The exact-signature check exists to stop stale labels surviving a re-scope, so
carry-over is deliberately conservative. At 0.75 a five-member community can
gain or lose one member (Jaccard
5/6 ≈ 0.83) and keep its name, while acommunity that swapped out more than a quarter of its members drops to the hub
label. A genuinely new community (no previous community of the same id) scores
0.0and never inherits a label. This keeps the anti-stale-label protection thesignature check was added for while eliminating the churn on single-member
edits.
Scope
The
watchupdate path already reuses labels by cid post-remap without thesignature invalidation, so it needed no change. Only
cluster-onlyhad thestrict gate.
The issue's bonus claude-subagents labeling backend is intentionally out of
scope for this PR.
Tests
tests/test_community_hub_labels.py: unit tests forcommunity_overlap_ratios(identical set → 1.0, one-member gain →
5/6≥ threshold, mostly-replaced →below threshold, brand-new cid → 0.0, empty previous → all 0.0) and a guard
that the threshold stays a conservative strong majority.
tests/test_labeling.py: end-to-endcluster-onlysubprocess tests — acommunity that gains one member keeps its LLM label (and no nag is
printed), and a mostly-replaced community drops to its hub label (with the
nag naming exactly the one renamed community). The "kept" test fails on the
pre-fix exact-signature behavior, so it is a true regression guard.
Command:
uv run pytest tests/test_labeling.py tests/test_community_hub_labels.py -q→ 36 passed. Broader
tests/test_cli_export.py tests/test_cluster.py tests/test_watch.py→ 93 passed, 2 skipped.