Skip to content

cluster-only skips overlap-based community remapping, so labels become misaligned after re-clustering #1027

Description

@bgmbgm94

Summary

graphify cluster-only re-runs Leiden clustering and then re-applies the existing
.graphify_labels.json by raw cid index, without invoking the
remap_communities_to_previous safety net that PR #822 introduced for the
watch / update paths. As a result, community labels attach to clusters whose
members are unrelated to the original label's meaning whenever the underlying
graph has changed between the labeling pass and the cluster-only invocation.

Reproduce

# 1) Full build with Step 5 labeling
/graphify .                          # or any equivalent that writes .graphify_labels.json

# 2) Make any code change that alters cluster sizes

# 3) Re-extract code without clustering (the official #822 workflow)
graphify update --no-cluster .

# 4) Re-cluster
graphify cluster-only .

# 5) Inspect graphify-out/GRAPH_REPORT.md or graph.html
#    → community labels are attached to clusters whose actual members
#      are unrelated to the label's original meaning.

The mismatch is deterministic: cluster() reindexes communities by size-descending
after each run, and the prior labels file is read back verbatim by cid index in
graphify/__main__.py:1714-1722.

Expected

After cluster-only, labels should follow the same overlap-based remapping that
PR #822 added to the watch/update paths — a label associated with a previous
community should attach to the new community whose node set most overlaps with
the old one.

Actual

Labels in .graphify_labels.json are reapplied verbatim by cid index, producing
misalignment whenever the graph topology has changed between labeling and
re-clustering.

Affected file

graphify/__main__.py:1714-1722 — the elif cmd == \"cluster-only\": branch.

The neighbouring code path in graphify/watch.py (_rebuild_code, around
lines 465-468) calls remap_communities_to_previous correctly — cluster-only
is the only re-clustering entry point that omits it.

Suggested fix

Mirror the working pattern from watch.py. Immediately after
communities = cluster(G) in the cluster-only branch, build the previous
node→community map from _raw (already loaded a few lines above) and apply
the remap:

from graphify.cluster import remap_communities_to_previous

previous_node_community = {
    n[\"id\"]: n[\"community\"]
    for n in _raw.get(\"nodes\", [])
    if n.get(\"community\") is not None and n.get(\"id\") is not None
}
if previous_node_community:
    communities = remap_communities_to_previous(communities, previous_node_community)

This matches the design intent stated in PR #822:

"stabilize clustering output with deterministic partition input ordering,
seeded Leiden when supported, and overlap-based remapping of new communities
to prior IDs"

It looks like cluster-only was missed when #822 was applied. If this was an
intentional omission (e.g. for a reason that does not apply to watch/update),
I'd love to understand the rationale before proposing a PR — happy to send one
mirroring the fix otherwise.

Environment

  • graphifyy 0.7.18 (PyPI)
  • Python 3.12.3
  • Linux (WSL2 / Ubuntu 24.04)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions