feat: add --missing-only flag to label command to skip already named communities by matiasduartee · Pull Request #1421 · Graphify-Labs/graphify

matiasduartee · 2026-06-21T19:49:34Z

I am working on a massive codebase that has over 7,500 nodes. When running graphify label ., the LLM backend often struggles to label all 1,000+ communities perfectly in one go due to API rate limits or parsing errors, successfully labeling some while leaving others as Community N. The core issue is that running graphify label . again completely wipes out the perfectly valid community names generated in the previous run, forcing the LLM to start from scratch. Sometimes, this leads to communities that were correctly named in the first run reverting back to Community N in the second run, creating an endless loop of wasted tokens.

To solve this, I introduced the --missing-only flag. When passed, graphify label . will first load .graphify_labels.json, preserve all successfully named communities, and only query the LLM backend for communities that are missing or still have the Community N placeholder.

This ensures a robust, incremental labeling process for very large codebases and saves a massive amount of LLM tokens and API costs. Tested locally on my 7,500+ node graph with perfect success.

…communities

matiasduartee · 2026-06-21T20:48:04Z

I just tested this implementation on a large internal project with over 7,500 nodes and 1,000+ communities, and it worked flawlessly.

Before this PR, whenever we ran graphify label . on a graph this size, the LLM backend would inevitably hit API rate limits or safety filters halfway through the process. Because the original command always labeled everything from scratch, a single failed batch would wipe out hundreds of perfectly good names generated in previous runs, replacing them all with Community N placeholders again. It was practically impossible to reach 100% coverage.

By using the --missing-only flag introduced here, we were able to incrementally label the graph. When we hit an API rate limit, we simply waited or swapped to a local model, and ran graphify label . --missing-only again. It successfully preserved our 700+ good labels and only processed the remaining Community N placeholders until we fully labeled all 1,027 communities. This flag makes the labeling process finally viable and robust for massive codebases!

matiasduartee · 2026-06-21T20:53:06Z

As a follow-up: I just re-ran graphify label . --missing-only on the same 1,024 community graph after it was fully labeled. It instantly outputted Labeling 0 missing communities... and finished without sending a single token to the LLM.

I want to emphasize that this implementation makes Graphify significantly more economical. In its current state, users are forced to burn massive amounts of tokens re-evaluating and re-naming perfectly good, established clusters on every single execution.

By matching semantic similarities and keeping established names intact, we are saving >99% in token costs and API execution time for subsequent updates to any codebase!

safishamsi

The core --missing-only logic is correct (relabels only unnamed / Community N placeholders, preserves real labels, int keys throughout). Three things to address:

Unrelated change bundled in: the MAX_NODES_FOR_VIZ 5000→15000 bump in export.py is unrelated to this flag — please split it into its own PR.
No tests — please cover the merge/skip logic (existing named communities preserved; only placeholders relabeled).
Confusing interaction: --missing-only only takes effect on label (force-relabel); for plain cluster-only with an existing labels file, control returns early and the flag is silently ignored. A line in the help text clarifying it pairs with label would help. Also a couple of trailing-whitespace lines to clean up.

matiasduartee · 2026-06-23T02:26:50Z

@safishamsi Thanks for the review! I've just pushed the requested changes:

Reverted the \MAX_NODES_FOR_VIZ\ increase back to 5000 in \export.py\ (we cancelled the other PR, so this is no longer needed).
Added tests to cover the merge/ignore logic (\ est_label_cli_missing_only_preserves_existing\ in \ ests/test_labeling.py) ensuring existing named communities are preserved and only placeholders are renamed.
Clarified in the help text that --missing-only\ works with the \label\ command and removed the trailing whitespaces.

Let me know if there's anything else!

@jiangyq9

…#1481) `graphify label --missing-only` restricts LLM labeling to communities that are unnamed or still hold a `Community N` placeholder, preserving existing non-placeholder labels read from .graphify_labels.json and merging new labels over them. Lets a large graph be relabeled incrementally without re-naming (and paying for) communities that already have good names. Ported from PR #1481 by @jiangyq9. This supersedes the earlier #1421 by @matiasduartee, which proposed the same flag — credit to @matiasduartee for the original; #1481 is written against the current label signature (post-#1390 max-concurrency/batch-size) and merges clean, where #1421 had drifted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

safishamsi · 2026-06-27T09:12:08Z

Thanks @matiasduartee, and apologies for the overlap. The label --missing-only flag has landed on v8 (383cabd) — the same feature you proposed here. Your branch had drifted ~47 commits behind v8 (conflicts in __main__.py and the labeling tests after the #1390 concurrency changes), so the version that merged was rebased on the current signature. Credited you in the commit for proposing it first. Closing as superseded — thank you.

@jiangyq9

…#1481) `graphify label --missing-only` restricts LLM labeling to communities that are unnamed or still hold a `Community N` placeholder, preserving existing non-placeholder labels read from .graphify_labels.json and merging new labels over them. Lets a large graph be relabeled incrementally without re-naming (and paying for) communities that already have good names. Ported from PR #1481 by @jiangyq9. This supersedes the earlier #1421 by @matiasduartee, which proposed the same flag — credit to @matiasduartee for the original; #1481 is written against the current label signature (post-#1390 max-concurrency/batch-size) and merges clean, where #1421 had drifted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

matiasduartee added 2 commits June 21, 2026 15:51

feat: increase default HTML visualization limit to 15,000 nodes

ffd2243

feat: add --missing-only flag to label command to skip already named …

a4daf2a

…communities

safishamsi requested changes Jun 22, 2026

View reviewed changes

Address review comments for --missing-only logic

708ac41

jiangyq9 mentioned this pull request Jun 27, 2026

feat(label): add missing-only relabeling #1481

Closed

safishamsi closed this Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add --missing-only flag to label command to skip already named communities#1421

feat: add --missing-only flag to label command to skip already named communities#1421
matiasduartee wants to merge 3 commits into
Graphify-Labs:v8from
matiasduartee:feat/missing-only-flag

matiasduartee commented Jun 21, 2026

Uh oh!

matiasduartee commented Jun 21, 2026

Uh oh!

matiasduartee commented Jun 21, 2026

Uh oh!

safishamsi left a comment

Uh oh!

matiasduartee commented Jun 23, 2026

Uh oh!

safishamsi commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

matiasduartee commented Jun 21, 2026

Uh oh!

matiasduartee commented Jun 21, 2026

Uh oh!

matiasduartee commented Jun 21, 2026

Uh oh!

safishamsi left a comment

Choose a reason for hiding this comment

Uh oh!

matiasduartee commented Jun 23, 2026

Uh oh!

safishamsi commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants