Summary
label_communities (llm.py:1855) uses a hardcoded max_tokens formula that ignores the existing _resolve_max_tokens() helper (llm.py:232), making GRAPHIFY_MAX_OUTPUT_TOKENS a dead env var for the labeling path. On graphs with many communities, the formula undershoots and the LLM response is truncated, producing unparseable JSON.
How we hit it
graphify extract . --backend deepseek on a Rust project (~215 files) → 1650 nodes, 3596 edges
- Clustering produced 92 communities
graphify label . --backend deepseek failed with:
[graphify label] warning: community labeling failed (Unterminated string starting at: line 7 column 8 (char 186)); using Community N placeholders.
Root cause
label_communities at line 1855 calculates its own token budget:
max_tokens = min(40 + 16 * len(labeled_cids), 4096)
# 92 communities → min(40 + 1472, 4096) = 1512 tokens
The prompt instructs the model to output a JSON object mapping 92 community IDs to 2-5 word labels. At ~15-20 tokens per entry (JSON key + quoted string + comma), the response needs roughly 1,400-1,800 tokens. At 1512, the model runs out of tokens mid-response, truncating the JSON string and producing an unparseable payload.
Meanwhile, _resolve_max_tokens() exists at line 232 specifically to let users override token budgets via GRAPHIFY_MAX_OUTPUT_TOKENS, and the main extraction path uses it (lines 1121, 1152). But label_communities never calls it — the env var is silently ignored for labeling.
Workaround
Patched line 1855 locally to increase the per-community token allocation:
# Original:
max_tokens = min(40 + 16 * len(labeled_cids), 4096)
# Patched:
max_tokens = min(80 + 45 * len(labeled_cids), 4096)
This gives the model enough headroom (4096 tokens for 92 communities) to complete the full JSON response without truncation. After applying this patch, graphify label . --backend deepseek succeeded and produced names for all 92 communities.
Summary
label_communities(llm.py:1855) uses a hardcodedmax_tokensformula that ignores the existing_resolve_max_tokens()helper (llm.py:232), makingGRAPHIFY_MAX_OUTPUT_TOKENSa dead env var for the labeling path. On graphs with many communities, the formula undershoots and the LLM response is truncated, producing unparseable JSON.How we hit it
graphify extract . --backend deepseekon a Rust project (~215 files) → 1650 nodes, 3596 edgesgraphify label . --backend deepseekfailed with:Root cause
label_communitiesat line 1855 calculates its own token budget:The prompt instructs the model to output a JSON object mapping 92 community IDs to 2-5 word labels. At ~15-20 tokens per entry (JSON key + quoted string + comma), the response needs roughly 1,400-1,800 tokens. At 1512, the model runs out of tokens mid-response, truncating the JSON string and producing an unparseable payload.
Meanwhile,
_resolve_max_tokens()exists at line 232 specifically to let users override token budgets viaGRAPHIFY_MAX_OUTPUT_TOKENS, and the main extraction path uses it (lines 1121, 1152). Butlabel_communitiesnever calls it — the env var is silently ignored for labeling.Workaround
Patched line 1855 locally to increase the per-community token allocation:
This gives the model enough headroom (4096 tokens for 92 communities) to complete the full JSON response without truncation. After applying this patch,
graphify label . --backend deepseeksucceeded and produced names for all 92 communities.