Skip to content

feat(cluster): carry community LLM labels over re-clustering via member overlap#1662

Open
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:feat/community-label-carryover
Open

feat(cluster): carry community LLM labels over re-clustering via member overlap#1662
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:feat/community-label-carryover

Conversation

@TPAteeq

@TPAteeq TPAteeq commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Fixes #1653

Problem

graphify cluster-only re-runs community detection and then reuses the saved
.graphify_labels.json. To avoid resurrecting a stale label after a re-scope,
the reuse loop guards each community with an exact membership signature
(community_member_sigs, hashed over the community's exact sorted member set).
If the signature differs at all, the saved (often LLM-generated) label is
discarded and the community is renamed to its structural hub (highest-degree
member).

That guard is too strict. A community that is "the same, give or take a member"
— it gained or lost a single node between labeling and the next cluster-only
gets a different signature and so loses its curated LLM name, snapping back to
a hub name like log_action. Community-ID drift is already handled
(remap_communities_to_previous, #822/#1027); the label loss here is caused
specifically by the exact-signature check, not by ID misalignment.

Fix

Relax the reuse gate from exact-signature-only to exact-signature OR
sufficient member overlap
:

  • New helper community_overlap_ratios(communities, previous_node_community) in
    cluster.py computes, per community id, the Jaccard overlap
    |new ∩ old| / |new ∪ old| against the previous community that shared its id.
    It runs after remap_communities_to_previous has aligned ids, so cid X's
    members are the natural successor of the community whose saved label and
    signature are also keyed on X. It reuses the same old_sets construction
    the remap already relies on — no new persistence, no new matching machinery.
  • In the cluster-only reuse loop, a community now counts as fresh (keeps its
    saved label) when its exact signature is unchanged or its overlap ratio is
    at least LABEL_CARRYOVER_MIN_JACCARD. Otherwise it falls back to the hub
    label exactly as before.
  • The "community set changed since labeling — run graphify label" nag now
    fires only for communities that actually lost their label, because a
    carried-over community is fresh and never increments the changed counter.

Chosen threshold

LABEL_CARRYOVER_MIN_JACCARD = 0.75 (module-level constant in cluster.py).

The exact-signature check exists to stop stale labels surviving a re-scope, so
carry-over is deliberately conservative. At 0.75 a five-member community can
gain or lose one member (Jaccard 5/6 ≈ 0.83) and keep its name, while a
community that swapped out more than a quarter of its members drops to the hub
label. A genuinely new community (no previous community of the same id) scores
0.0 and never inherits a label. This keeps the anti-stale-label protection the
signature check was added for while eliminating the churn on single-member
edits.

Scope

The watch update path already reuses labels by cid post-remap without the
signature invalidation, so it needed no change. Only cluster-only had the
strict gate.

The issue's bonus claude-subagents labeling backend is intentionally out of
scope
for this PR.

Tests

  • tests/test_community_hub_labels.py: unit tests for community_overlap_ratios
    (identical set → 1.0, one-member gain → 5/6 ≥ threshold, mostly-replaced →
    below threshold, brand-new cid → 0.0, empty previous → all 0.0) and a guard
    that the threshold stays a conservative strong majority.
  • tests/test_labeling.py: end-to-end cluster-only subprocess tests — a
    community that gains one member keeps its LLM label (and no nag is
    printed), and a mostly-replaced community drops to its hub label (with the
    nag naming exactly the one renamed community). The "kept" test fails on the
    pre-fix exact-signature behavior, so it is a true regression guard.

Command: uv run pytest tests/test_labeling.py tests/test_community_hub_labels.py -q
→ 36 passed. Broader tests/test_cli_export.py tests/test_cluster.py tests/test_watch.py → 93 passed, 2 skipped.

TPAteeq and others added 2 commits July 5, 2026 00:01
…overlap (Graphify-Labs#1653)

`cluster-only` invalidated a saved LLM community label whenever the exact
membership signature changed, so a community that merely gained or lost a
member was reset to a structural hub name. Relax the reuse gate: keep the saved
label when the re-clustered community still overlaps the previous community of
the same id by at least LABEL_CARRYOVER_MIN_JACCARD (0.75), else fall back to
the hub label as before. The "run `graphify label`" nag now fires only for
communities that genuinely lost their label.

Adds community_overlap_ratios() alongside the existing overlap-based cid remap,
plus unit tests for the ratio math and end-to-end cluster-only tests proving a
one-member gain keeps the label while a mostly-replaced community drops it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rage (Graphify-Labs#1653)

Review follow-up for the label carry-over feature:

- Add the required `## Unreleased` CHANGELOG bullet for the feature.
- Doc: note on community_overlap_ratios that the Jaccard is over SURVIVING
  nodes in graph.json, so it is intentionally insensitive to deletions
  (design is unchanged).
- Cover the no-sig reuse branch (`or carried`): a fixture that omits the
  `.graphify_labels.json.sig` and asserts a >=0.75-overlap community still
  carries its label under a community-count mismatch.
- Add overlap-ratio edge cases: the inclusive boundary at exactly 0.75
  (keep), a single-member swap on a small community (3/5 = 0.6 -> drop),
  and a reused cid with partial overlap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: carry community labels over re-clustering via member-overlap matching (invalidation exists, carry-over does not)

1 participant