Skip to content

label_communities ignores _resolve_max_tokens() — hardcoded token budget truncates label responses for graphs with 50+ communities #1200

Description

@barduinor

Summary

label_communities (llm.py:1855) uses a hardcoded max_tokens formula that ignores the existing _resolve_max_tokens() helper (llm.py:232), making GRAPHIFY_MAX_OUTPUT_TOKENS a dead env var for the labeling path. On graphs with many communities, the formula undershoots and the LLM response is truncated, producing unparseable JSON.

How we hit it

  1. graphify extract . --backend deepseek on a Rust project (~215 files) → 1650 nodes, 3596 edges
  2. Clustering produced 92 communities
  3. graphify label . --backend deepseek failed with:
    [graphify label] warning: community labeling failed (Unterminated string starting at: line 7 column 8 (char 186)); using Community N placeholders.
    

Root cause

label_communities at line 1855 calculates its own token budget:

max_tokens = min(40 + 16 * len(labeled_cids), 4096)
# 92 communities → min(40 + 1472, 4096) = 1512 tokens

The prompt instructs the model to output a JSON object mapping 92 community IDs to 2-5 word labels. At ~15-20 tokens per entry (JSON key + quoted string + comma), the response needs roughly 1,400-1,800 tokens. At 1512, the model runs out of tokens mid-response, truncating the JSON string and producing an unparseable payload.

Meanwhile, _resolve_max_tokens() exists at line 232 specifically to let users override token budgets via GRAPHIFY_MAX_OUTPUT_TOKENS, and the main extraction path uses it (lines 1121, 1152). But label_communities never calls it — the env var is silently ignored for labeling.

Workaround

Patched line 1855 locally to increase the per-community token allocation:

# Original:
max_tokens = min(40 + 16 * len(labeled_cids), 4096)

# Patched:
max_tokens = min(80 + 45 * len(labeled_cids), 4096)

This gives the model enough headroom (4096 tokens for 92 communities) to complete the full JSON response without truncation. After applying this patch, graphify label . --backend deepseek succeeded and produced names for all 92 communities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions