feat: add --missing-only flag to label command to skip already named communities#1421
feat: add --missing-only flag to label command to skip already named communities#1421matiasduartee wants to merge 3 commits into
Conversation
|
I just tested this implementation on a large internal project with over 7,500 nodes and 1,000+ communities, and it worked flawlessly. Before this PR, whenever we ran By using the |
|
As a follow-up: I just re-ran I want to emphasize that this implementation makes Graphify significantly more economical. In its current state, users are forced to burn massive amounts of tokens re-evaluating and re-naming perfectly good, established clusters on every single execution. By matching semantic similarities and keeping established names intact, we are saving >99% in token costs and API execution time for subsequent updates to any codebase! |
safishamsi
left a comment
There was a problem hiding this comment.
The core --missing-only logic is correct (relabels only unnamed / Community N placeholders, preserves real labels, int keys throughout). Three things to address:
- Unrelated change bundled in: the
MAX_NODES_FOR_VIZ5000→15000 bump inexport.pyis unrelated to this flag — please split it into its own PR. - No tests — please cover the merge/skip logic (existing named communities preserved; only placeholders relabeled).
- Confusing interaction:
--missing-onlyonly takes effect onlabel(force-relabel); for plaincluster-onlywith an existing labels file, control returns early and the flag is silently ignored. A line in the help text clarifying it pairs withlabelwould help. Also a couple of trailing-whitespace lines to clean up.
|
@safishamsi Thanks for the review! I've just pushed the requested changes:
Let me know if there's anything else! |
…#1481) `graphify label --missing-only` restricts LLM labeling to communities that are unnamed or still hold a `Community N` placeholder, preserving existing non-placeholder labels read from .graphify_labels.json and merging new labels over them. Lets a large graph be relabeled incrementally without re-naming (and paying for) communities that already have good names. Ported from PR #1481 by @jiangyq9. This supersedes the earlier #1421 by @matiasduartee, which proposed the same flag — credit to @matiasduartee for the original; #1481 is written against the current label signature (post-#1390 max-concurrency/batch-size) and merges clean, where #1421 had drifted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks @matiasduartee, and apologies for the overlap. The |
…#1481) `graphify label --missing-only` restricts LLM labeling to communities that are unnamed or still hold a `Community N` placeholder, preserving existing non-placeholder labels read from .graphify_labels.json and merging new labels over them. Lets a large graph be relabeled incrementally without re-naming (and paying for) communities that already have good names. Ported from PR #1481 by @jiangyq9. This supersedes the earlier #1421 by @matiasduartee, which proposed the same flag — credit to @matiasduartee for the original; #1481 is written against the current label signature (post-#1390 max-concurrency/batch-size) and merges clean, where #1421 had drifted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I am working on a massive codebase that has over 7,500 nodes. When running
graphify label ., the LLM backend often struggles to label all 1,000+ communities perfectly in one go due to API rate limits or parsing errors, successfully labeling some while leaving others asCommunity N. The core issue is that runninggraphify label .again completely wipes out the perfectly valid community names generated in the previous run, forcing the LLM to start from scratch. Sometimes, this leads to communities that were correctly named in the first run reverting back toCommunity Nin the second run, creating an endless loop of wasted tokens.To solve this, I introduced the
--missing-onlyflag. When passed,graphify label .will first load.graphify_labels.json, preserve all successfully named communities, and only query the LLM backend for communities that are missing or still have theCommunity Nplaceholder.This ensures a robust, incremental labeling process for very large codebases and saves a massive amount of LLM tokens and API costs. Tested locally on my 7,500+ node graph with perfect success.