Summary
graphify cluster-only re-runs Leiden clustering and then re-applies the existing
.graphify_labels.json by raw cid index, without invoking the
remap_communities_to_previous safety net that PR #822 introduced for the
watch / update paths. As a result, community labels attach to clusters whose
members are unrelated to the original label's meaning whenever the underlying
graph has changed between the labeling pass and the cluster-only invocation.
Reproduce
# 1) Full build with Step 5 labeling
/graphify . # or any equivalent that writes .graphify_labels.json
# 2) Make any code change that alters cluster sizes
# 3) Re-extract code without clustering (the official #822 workflow)
graphify update --no-cluster .
# 4) Re-cluster
graphify cluster-only .
# 5) Inspect graphify-out/GRAPH_REPORT.md or graph.html
# → community labels are attached to clusters whose actual members
# are unrelated to the label's original meaning.
The mismatch is deterministic: cluster() reindexes communities by size-descending
after each run, and the prior labels file is read back verbatim by cid index in
graphify/__main__.py:1714-1722.
Expected
After cluster-only, labels should follow the same overlap-based remapping that
PR #822 added to the watch/update paths — a label associated with a previous
community should attach to the new community whose node set most overlaps with
the old one.
Actual
Labels in .graphify_labels.json are reapplied verbatim by cid index, producing
misalignment whenever the graph topology has changed between labeling and
re-clustering.
Affected file
graphify/__main__.py:1714-1722 — the elif cmd == \"cluster-only\": branch.
The neighbouring code path in graphify/watch.py (_rebuild_code, around
lines 465-468) calls remap_communities_to_previous correctly — cluster-only
is the only re-clustering entry point that omits it.
Suggested fix
Mirror the working pattern from watch.py. Immediately after
communities = cluster(G) in the cluster-only branch, build the previous
node→community map from _raw (already loaded a few lines above) and apply
the remap:
from graphify.cluster import remap_communities_to_previous
previous_node_community = {
n[\"id\"]: n[\"community\"]
for n in _raw.get(\"nodes\", [])
if n.get(\"community\") is not None and n.get(\"id\") is not None
}
if previous_node_community:
communities = remap_communities_to_previous(communities, previous_node_community)
This matches the design intent stated in PR #822:
"stabilize clustering output with deterministic partition input ordering,
seeded Leiden when supported, and overlap-based remapping of new communities
to prior IDs"
It looks like cluster-only was missed when #822 was applied. If this was an
intentional omission (e.g. for a reason that does not apply to watch/update),
I'd love to understand the rationale before proposing a PR — happy to send one
mirroring the fix otherwise.
Environment
graphifyy 0.7.18 (PyPI)
- Python 3.12.3
- Linux (WSL2 / Ubuntu 24.04)
Summary
graphify cluster-onlyre-runs Leiden clustering and then re-applies the existing.graphify_labels.jsonby raw cid index, without invoking theremap_communities_to_previoussafety net that PR #822 introduced for thewatch/updatepaths. As a result, community labels attach to clusters whosemembers are unrelated to the original label's meaning whenever the underlying
graph has changed between the labeling pass and the
cluster-onlyinvocation.Reproduce
The mismatch is deterministic:
cluster()reindexes communities by size-descendingafter each run, and the prior labels file is read back verbatim by cid index in
graphify/__main__.py:1714-1722.Expected
After
cluster-only, labels should follow the same overlap-based remapping thatPR #822 added to the watch/update paths — a label associated with a previous
community should attach to the new community whose node set most overlaps with
the old one.
Actual
Labels in
.graphify_labels.jsonare reapplied verbatim by cid index, producingmisalignment whenever the graph topology has changed between labeling and
re-clustering.
Affected file
graphify/__main__.py:1714-1722— theelif cmd == \"cluster-only\":branch.The neighbouring code path in
graphify/watch.py(_rebuild_code, aroundlines 465-468) calls
remap_communities_to_previouscorrectly —cluster-onlyis the only re-clustering entry point that omits it.
Suggested fix
Mirror the working pattern from
watch.py. Immediately aftercommunities = cluster(G)in thecluster-onlybranch, build the previousnode→community map from
_raw(already loaded a few lines above) and applythe remap:
This matches the design intent stated in PR #822:
It looks like
cluster-onlywas missed when #822 was applied. If this was anintentional omission (e.g. for a reason that does not apply to
watch/update),I'd love to understand the rationale before proposing a PR — happy to send one
mirroring the fix otherwise.
Environment
graphifyy0.7.18 (PyPI)