Skip to content

fix: improve token counting accuracy #64

@platypusrex

Description

@platypusrex

Problem

Token counting has known inaccuracies that affect the sidebar display and compaction trigger timing:

  1. Heuristic overestimates by ~40%. Our char-based heuristic (2.5 chars/token) estimated ~113k tokens for a conversation where prompt_eval_count reported ~80k. The real ratio for code-heavy conversations appears closer to 3.5-4 chars/token.

  2. No heuristic calibration. After each model call, we have ground truth (prompt_eval_count) that could calibrate the heuristic for subsequent estimates, but we discard it.

  3. Tool schema overhead is a magic number. OVERHEAD_AGENT_LOOP = 6,000 is a guess. The actual tool JSON schemas should be measured once at agent init.

  4. Sidebar and agent loop use different overhead constants. OVERHEAD_SIDEBAR = 10,000 vs OVERHEAD_AGENT_LOOP = 6,000 — these should be unified or at least derived from the same base measurement.

  5. Context limit from /api/show may not match operational limit. The context_length from model_info reports the model architecture's maximum, but the actual limit depends on num_ctx configuration. We should parse num_ctx from the parameters field of /api/show responses and use it when present.

Solution

  1. Calibrate heuristic with real counts. After each model call that returns prompt_eval_count, compute correction = realTokens / estimatedTokens. Apply this to subsequent heuristic estimates (stored per-session, reset on model change).

  2. Measure tool schema overhead dynamically. At agent init, serialize all tool schemas to JSON, count characters, estimate tokens. Use this instead of a constant.

  3. Unify overhead calculation. Single function that computes overhead from system prompt + tool schemas. Both sidebar and agent loop call it.

  4. Parse num_ctx from /api/show parameters. When present, use min(context_length, num_ctx) as the effective context limit.

  5. Better debug logging. Log total message payload size (chars), estimated tokens, real tokens (when available), and the correction factor. This makes future debugging possible.

Key files

  • src/lib/tokenizer.ts — heuristic, overhead constants, fetchModelInfo
  • src/agent/index.tsbuildContextUsage(), agent loop compaction check
  • src/tui/hooks/use-agent-context.ts — sidebar stats, OVERHEAD_SIDEBAR
  • src/agent/stream-handler.ts — captures prompt_eval_count/eval_count

Research context

  • opencode uses provider-reported counts only (via Vercel AI SDK usage response), with a 4 chars/token heuristic only for pruning tool outputs. No local tokenizer.
  • Ollama source (runner/ollamarunner/runner.go): prompt_eval_count = seq.numPromptInputs = total prompt tokens before KV cache trimming. It IS the full prompt size, not incremental.
  • /api/show response includes both model_info.{arch}.context_length (architecture max) and parameters string which may contain num_ctx (configured operational limit).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions