Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Full release notes with details on each version: [GitHub Releases](https://githu

## Unreleased

- Feat: a re-clustered community that keeps most of its members now retains its saved LLM label instead of resetting to a structural hub name (#1653, thanks @Ns2384-star). `graphify cluster-only` previously dropped a community's saved label on any exact-membership-signature change, so a community that merely gained or lost a member was renamed to its highest-degree hub (e.g. an `auth` community reverting to a bare `log_action`). Label carry-over is now gated on member overlap: a community that still shares at least `LABEL_CARRYOVER_MIN_JACCARD` (0.75) Jaccard overlap with the previous community of the same id keeps its name, while a mostly-replaced community still falls back to the hub label. The "run `graphify label`" nag now fires only for communities that genuinely lost their label.
- Fix: a malformed semantic chunk no longer crashes `extract` and discards every successful chunk (#1631, thanks @ssazy). When an LLM returned a well-formed object whose `edges` (or `nodes`/`hyperedges`) array carried a stray non-dict entry — a nested list where an edge object belongs — the AST+semantic merge and the semantic-cache write both called `.get()` per entry and raised `AttributeError: 'list' object has no attribute 'get'`. On a 34-chunk run where 33 succeeded, that meant no `graph.json` was written and the cache write failed too, so a re-run re-extracted everything. `_parse_llm_json` now sanitizes each fragment at the single parse chokepoint (keeping only dict entries and coercing a non-list value to `[]`), so the cache writer, the adaptive-retry merge, and the CLI merge are all protected in one place.
- Fix: an unresolved bare npm import no longer aliases onto an unrelated same-named local file (#1638, thanks @EveX1). `import colors from "tailwindcss/colors"` in a `.tsx` file emitted an `imports_from` edge to the bare id `colors`, and build.py's pre-migration alias index (which registers every local file's bare stem) then remapped it onto an unrelated `backend/utils/colors.py` — a confident (`EXTRACTED`) cross-language phantom edge, and one per `.tsx` file sharing the import. In a real monorepo eight unrelated `.tsx` files all landed on a single Python module. Common package subpaths (`colors`, `utils`, `types`, `config`, `client`) collide this way constantly. The external-import fallback now namespaces its target with the `ref` prefix (the same J-4 convention used for tsconfig `extends`/`$ref` externals), so it can never collapse to a local file/symbol id; the ref-namespaced target has no node, so build drops it as an external reference — the correct outcome for a third-party import.
- Fix: `graph.json` node/edge ordering is now stable run-to-run for document/semantic corpora (#1632, thanks @umeshpsatwe). With a parallel LLM backend, `extract_corpus_parallel` merged chunk results in completion order, so which network call happened to return first reordered the nodes and edges even when the model returned identical content — churning `graph.json` between otherwise-identical runs. Chunks are now merged in deterministic submission order after the pool drains (matching the serial path); the progress callback still fires in completion order so long local runs aren't silent. Note: the semantic content the LLM extracts is itself nondeterministic run-to-run — this fix removes the pipeline's own ordering churn, not the model's variance.
Expand Down
21 changes: 18 additions & 3 deletions graphify/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3559,7 +3559,12 @@ def main() -> None:
# is told to `graphify label` for fresh LLM names. Unchanged communities keep
# their saved label. When no signature sidecar exists (labels predate this),
# fall back to hub-filling only the communities missing a label.
from graphify.cluster import community_member_sigs, label_communities_by_hub
from graphify.cluster import (
LABEL_CARRYOVER_MIN_JACCARD,
community_member_sigs,
community_overlap_ratios,
label_communities_by_hub,
)
sig_path = labels_path.parent / (labels_path.name + ".sig")
saved_sigs: dict[int, str] = {}
if sig_path.exists():
Expand All @@ -3572,21 +3577,31 @@ def main() -> None:
except Exception:
saved_sigs = {}
cur_sigs = community_member_sigs(communities)
# Overlap of each (already remapped) community against the previous
# community that shared its id. An exact-signature mismatch used to
# discard the saved LLM label outright, so a community that merely
# gained/lost a member was renamed to its hub (#1653). Carry the
# label over when the two are still substantially the same set.
overlap_ratios = community_overlap_ratios(communities, previous_node_community)
count_mismatch = len(existing_labels) != len(communities)
labels = {}
hub_labels: dict[int, str] | None = None
changed = 0
for cid in communities:
have_label = cid in existing_labels
# Same community, give or take a member: keep its saved label
# even though the exact signature changed, gated on a conservative
# Jaccard so a genuinely different community can't inherit it.
carried = have_label and overlap_ratios.get(cid, 0.0) >= LABEL_CARRYOVER_MIN_JACCARD
if saved_sigs:
# Precise: the membership signature tells us if this exact
# community changed since it was labeled.
fresh = have_label and saved_sigs.get(cid) == cur_sigs.get(cid)
fresh = (have_label and saved_sigs.get(cid) == cur_sigs.get(cid)) or carried
else:
# No signature sidecar (labels predate it). A differing community
# COUNT means the labels describe a different clustering, so a cid's
# old label can't be trusted; equal count is the best "same" signal.
fresh = have_label and not count_mismatch
fresh = (have_label and not count_mismatch) or carried
if fresh:
labels[cid] = existing_labels[cid]
else:
Expand Down
49 changes: 49 additions & 0 deletions graphify/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,3 +318,52 @@ def remap_communities_to_previous(
for new_cid, nodes in communities.items():
remapped[new_to_final[new_cid]] = sorted(nodes)
return dict(sorted(remapped.items(), key=lambda kv: kv[0]))


# Minimum Jaccard overlap between a re-clustered community and the previous
# community that shared its id for the saved LLM label to be carried over
# instead of reset to a structural hub name (#1653). Conservative on purpose:
# the exact-membership signature check (community_member_sigs) was added to stop
# stale labels surviving a re-scope, so carry-over only kicks in when the two
# communities are "the same, give or take a member". At 0.75 a five-member
# community may gain or lose one member (Jaccard 5/6 ≈ 0.83) and keep its name,
# but a community that swapped out a quarter of its members drops to the hub.
LABEL_CARRYOVER_MIN_JACCARD = 0.75


def community_overlap_ratios(
communities: dict[int, list[str]],
previous_node_community: dict[str, int],
) -> dict[int, float]:
"""Jaccard overlap of each community against the previous community with the
same id: ``{cid: |new ∩ old| / |new ∪ old|}``.

Call *after* :func:`remap_communities_to_previous` has aligned ids to the
prior assignment, so cid ``X``'s members are the natural successor of the
community whose saved label/signature are also keyed on ``X``. A ratio near
1.0 means "the same community, give or take a member" — enough to carry a
saved LLM label across a re-cluster (see ``LABEL_CARRYOVER_MIN_JACCARD``)
rather than resetting it to a hub name (#1653). A cid with no previous
community of the same id (a genuinely new community) scores 0.0.

``previous_node_community`` is read from the surviving nodes' saved
``community`` tags in the current ``graph.json``, so the Jaccard is computed
over SURVIVING nodes only: a member deleted from the graph is absent from
both sets and neither shrinks nor inflates the overlap. This deletion-
insensitivity is intentional — a community losing nodes to deletion is still
"the same community", so it keeps its label.
"""
old_sets: dict[int, set[str]] = {}
for node, old_cid in previous_node_community.items():
old_sets.setdefault(old_cid, set()).add(str(node))

ratios: dict[int, float] = {}
for cid, nodes in communities.items():
new_set = {str(n) for n in nodes}
old_set = old_sets.get(cid)
if not new_set or not old_set:
ratios[cid] = 0.0
continue
union = len(new_set | old_set)
ratios[cid] = (len(new_set & old_set) / union) if union else 0.0
return ratios
84 changes: 84 additions & 0 deletions tests/test_community_hub_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,87 @@ def test_community_member_sigs_change_when_membership_changes():
before = community_member_sigs({0: ["x", "y", "z"]})
after = community_member_sigs({0: ["x", "y"]}) # a node left the community
assert before[0] != after[0], "signature must change when a community's members change"


# ── label carry-over via member overlap (cluster-only re-cluster, #1653) ───────

def test_overlap_ratio_identical_community_is_one():
from graphify.cluster import community_overlap_ratios
prev = {"a": 0, "b": 0, "c": 0}
ratios = community_overlap_ratios({0: ["a", "b", "c"]}, prev)
assert ratios[0] == 1.0


def test_overlap_ratio_gained_one_member_stays_above_threshold():
# Old community was 5 members; new run gained one -> Jaccard 5/6 ≈ 0.83.
from graphify.cluster import community_overlap_ratios, LABEL_CARRYOVER_MIN_JACCARD
prev = {f"n{i}": 0 for i in range(5)}
new_members = [f"n{i}" for i in range(5)] + ["n5"]
ratios = community_overlap_ratios({0: new_members}, prev)
assert ratios[0] == 5 / 6
assert ratios[0] >= LABEL_CARRYOVER_MIN_JACCARD, "a one-member drift must clear the carry-over gate"


def test_overlap_ratio_swapped_most_members_drops_below_threshold():
# Only one of six members survives -> Jaccard 1/6 ≈ 0.17, well below the gate.
from graphify.cluster import community_overlap_ratios, LABEL_CARRYOVER_MIN_JACCARD
prev = {f"old{i}": 0 for i in range(6)}
new_members = ["old0"] + [f"new{i}" for i in range(5)]
ratios = community_overlap_ratios({0: new_members}, prev)
assert ratios[0] == 1 / 11
assert ratios[0] < LABEL_CARRYOVER_MIN_JACCARD, "a mostly-new community must NOT carry the stale label"


def test_overlap_ratio_new_community_scores_zero():
# cid 1 has no previous community of the same id -> genuinely new -> 0.0.
from graphify.cluster import community_overlap_ratios
prev = {"a": 0, "b": 0}
ratios = community_overlap_ratios({0: ["a", "b"], 1: ["x", "y"]}, prev)
assert ratios[1] == 0.0


def test_overlap_ratio_empty_previous_is_all_zero():
from graphify.cluster import community_overlap_ratios
ratios = community_overlap_ratios({0: ["a"], 1: ["b"]}, {})
assert ratios == {0: 0.0, 1: 0.0}


def test_carryover_threshold_is_conservative():
# A stale-label guard: the default must be a strong majority overlap so a
# re-scoped community can't silently inherit an old LLM name (#1653).
from graphify.cluster import LABEL_CARRYOVER_MIN_JACCARD
assert 0.5 < LABEL_CARRYOVER_MIN_JACCARD <= 1.0


def test_overlap_ratio_exactly_at_threshold_keeps_inclusive():
# old {1,2,3} -> new {1,2,3,4}: intersection 3, union 4 -> Jaccard exactly 0.75.
# The gate is inclusive (>=), so a community sitting right on the boundary
# KEEPS its label. Pins the `>=` against an accidental flip to `>`.
from graphify.cluster import community_overlap_ratios, LABEL_CARRYOVER_MIN_JACCARD
prev = {"1": 0, "2": 0, "3": 0}
ratios = community_overlap_ratios({0: ["1", "2", "3", "4"]}, prev)
assert ratios[0] == 0.75
assert ratios[0] >= LABEL_CARRYOVER_MIN_JACCARD, "an exact 0.75 overlap must clear the inclusive gate"


def test_overlap_ratio_single_member_swap_drops_below_threshold():
# A single add+remove on a 4-member community: old {1,2,3,4} -> new {1,2,3,5}.
# intersection 3, union 5 -> Jaccard 0.6, just under the gate -> DROP the label.
# Unlike the extreme 1/11 swap, this is the tightest failing case.
from graphify.cluster import community_overlap_ratios, LABEL_CARRYOVER_MIN_JACCARD
prev = {"1": 0, "2": 0, "3": 0, "4": 0}
ratios = community_overlap_ratios({0: ["1", "2", "3", "5"]}, prev)
assert ratios[0] == 0.6
assert ratios[0] < LABEL_CARRYOVER_MIN_JACCARD, "a single-member swap on a small community must drop the label"


def test_overlap_ratio_reused_cid_partial_overlap():
# A reused cid whose community only partially overlaps its predecessor:
# old {p,q,r,s} -> new {p,q,x,y}. intersection 2, union 6 -> Jaccard 1/3.
# A partial reuse is neither a fresh community (0.0) nor an identity (1.0).
from graphify.cluster import community_overlap_ratios, LABEL_CARRYOVER_MIN_JACCARD
prev = {"p": 0, "q": 0, "r": 0, "s": 0}
ratios = community_overlap_ratios({0: ["p", "q", "x", "y"]}, prev)
assert ratios[0] == 1 / 3
assert 0.0 < ratios[0] < 1.0
assert ratios[0] < LABEL_CARRYOVER_MIN_JACCARD, "a partial reuse below the gate must not carry the label"
Loading