Skip to content

feat: add --missing-only flag to label command to skip already named communities#1421

Closed
matiasduartee wants to merge 3 commits into
Graphify-Labs:v8from
matiasduartee:feat/missing-only-flag
Closed

feat: add --missing-only flag to label command to skip already named communities#1421
matiasduartee wants to merge 3 commits into
Graphify-Labs:v8from
matiasduartee:feat/missing-only-flag

Conversation

@matiasduartee

Copy link
Copy Markdown
Contributor

I am working on a massive codebase that has over 7,500 nodes. When running graphify label ., the LLM backend often struggles to label all 1,000+ communities perfectly in one go due to API rate limits or parsing errors, successfully labeling some while leaving others as Community N. The core issue is that running graphify label . again completely wipes out the perfectly valid community names generated in the previous run, forcing the LLM to start from scratch. Sometimes, this leads to communities that were correctly named in the first run reverting back to Community N in the second run, creating an endless loop of wasted tokens.

To solve this, I introduced the --missing-only flag. When passed, graphify label . will first load .graphify_labels.json, preserve all successfully named communities, and only query the LLM backend for communities that are missing or still have the Community N placeholder.

This ensures a robust, incremental labeling process for very large codebases and saves a massive amount of LLM tokens and API costs. Tested locally on my 7,500+ node graph with perfect success.

@matiasduartee

Copy link
Copy Markdown
Contributor Author

I just tested this implementation on a large internal project with over 7,500 nodes and 1,000+ communities, and it worked flawlessly.

Before this PR, whenever we ran graphify label . on a graph this size, the LLM backend would inevitably hit API rate limits or safety filters halfway through the process. Because the original command always labeled everything from scratch, a single failed batch would wipe out hundreds of perfectly good names generated in previous runs, replacing them all with Community N placeholders again. It was practically impossible to reach 100% coverage.

By using the --missing-only flag introduced here, we were able to incrementally label the graph. When we hit an API rate limit, we simply waited or swapped to a local model, and ran graphify label . --missing-only again. It successfully preserved our 700+ good labels and only processed the remaining Community N placeholders until we fully labeled all 1,027 communities. This flag makes the labeling process finally viable and robust for massive codebases!

@matiasduartee

Copy link
Copy Markdown
Contributor Author

As a follow-up: I just re-ran graphify label . --missing-only on the same 1,024 community graph after it was fully labeled. It instantly outputted Labeling 0 missing communities... and finished without sending a single token to the LLM.

I want to emphasize that this implementation makes Graphify significantly more economical. In its current state, users are forced to burn massive amounts of tokens re-evaluating and re-naming perfectly good, established clusters on every single execution.

By matching semantic similarities and keeping established names intact, we are saving >99% in token costs and API execution time for subsequent updates to any codebase!

@safishamsi safishamsi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core --missing-only logic is correct (relabels only unnamed / Community N placeholders, preserves real labels, int keys throughout). Three things to address:

  • Unrelated change bundled in: the MAX_NODES_FOR_VIZ 5000→15000 bump in export.py is unrelated to this flag — please split it into its own PR.
  • No tests — please cover the merge/skip logic (existing named communities preserved; only placeholders relabeled).
  • Confusing interaction: --missing-only only takes effect on label (force-relabel); for plain cluster-only with an existing labels file, control returns early and the flag is silently ignored. A line in the help text clarifying it pairs with label would help. Also a couple of trailing-whitespace lines to clean up.

@matiasduartee

Copy link
Copy Markdown
Contributor Author

@safishamsi Thanks for the review! I've just pushed the requested changes:

  1. Reverted the \MAX_NODES_FOR_VIZ\ increase back to 5000 in \export.py\ (we cancelled the other PR, so this is no longer needed).
  2. Added tests to cover the merge/ignore logic (\ est_label_cli_missing_only_preserves_existing\ in \ ests/test_labeling.py) ensuring existing named communities are preserved and only placeholders are renamed.
  3. Clarified in the help text that --missing-only\ works with the \label\ command and removed the trailing whitespaces.

Let me know if there's anything else!

safishamsi pushed a commit that referenced this pull request Jun 27, 2026
…#1481)

`graphify label --missing-only` restricts LLM labeling to communities that are
unnamed or still hold a `Community N` placeholder, preserving existing
non-placeholder labels read from .graphify_labels.json and merging new labels over
them. Lets a large graph be relabeled incrementally without re-naming (and paying
for) communities that already have good names.

Ported from PR #1481 by @jiangyq9. This supersedes the earlier #1421 by
@matiasduartee, which proposed the same flag — credit to @matiasduartee for the
original; #1481 is written against the current label signature (post-#1390
max-concurrency/batch-size) and merges clean, where #1421 had drifted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@safishamsi

Copy link
Copy Markdown
Collaborator

Thanks @matiasduartee, and apologies for the overlap. The label --missing-only flag has landed on v8 (383cabd) — the same feature you proposed here. Your branch had drifted ~47 commits behind v8 (conflicts in __main__.py and the labeling tests after the #1390 concurrency changes), so the version that merged was rebased on the current signature. Credited you in the commit for proposing it first. Closing as superseded — thank you.

@safishamsi safishamsi closed this Jun 27, 2026
safishamsi pushed a commit that referenced this pull request Jun 27, 2026
…#1481)

`graphify label --missing-only` restricts LLM labeling to communities that are
unnamed or still hold a `Community N` placeholder, preserving existing
non-placeholder labels read from .graphify_labels.json and merging new labels over
them. Lets a large graph be relabeled incrementally without re-naming (and paying
for) communities that already have good names.

Ported from PR #1481 by @jiangyq9. This supersedes the earlier #1421 by
@matiasduartee, which proposed the same flag — credit to @matiasduartee for the
original; #1481 is written against the current label signature (post-#1390
max-concurrency/batch-size) and merges clean, where #1421 had drifted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants