-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Token counting has known inaccuracies that affect the sidebar display and compaction trigger timing:
-
Heuristic overestimates by ~40%. Our char-based heuristic (2.5 chars/token) estimated ~113k tokens for a conversation where
prompt_eval_countreported ~80k. The real ratio for code-heavy conversations appears closer to 3.5-4 chars/token. -
No heuristic calibration. After each model call, we have ground truth (
prompt_eval_count) that could calibrate the heuristic for subsequent estimates, but we discard it. -
Tool schema overhead is a magic number.
OVERHEAD_AGENT_LOOP = 6,000is a guess. The actual tool JSON schemas should be measured once at agent init. -
Sidebar and agent loop use different overhead constants.
OVERHEAD_SIDEBAR = 10,000vsOVERHEAD_AGENT_LOOP = 6,000— these should be unified or at least derived from the same base measurement. -
Context limit from
/api/showmay not match operational limit. Thecontext_lengthfrommodel_inforeports the model architecture's maximum, but the actual limit depends onnum_ctxconfiguration. We should parsenum_ctxfrom theparametersfield of/api/showresponses and use it when present.
Solution
-
Calibrate heuristic with real counts. After each model call that returns
prompt_eval_count, computecorrection = realTokens / estimatedTokens. Apply this to subsequent heuristic estimates (stored per-session, reset on model change). -
Measure tool schema overhead dynamically. At agent init, serialize all tool schemas to JSON, count characters, estimate tokens. Use this instead of a constant.
-
Unify overhead calculation. Single function that computes overhead from system prompt + tool schemas. Both sidebar and agent loop call it.
-
Parse
num_ctxfrom/api/showparameters. When present, usemin(context_length, num_ctx)as the effective context limit. -
Better debug logging. Log total message payload size (chars), estimated tokens, real tokens (when available), and the correction factor. This makes future debugging possible.
Key files
src/lib/tokenizer.ts— heuristic, overhead constants,fetchModelInfosrc/agent/index.ts—buildContextUsage(), agent loop compaction checksrc/tui/hooks/use-agent-context.ts— sidebar stats,OVERHEAD_SIDEBARsrc/agent/stream-handler.ts— capturesprompt_eval_count/eval_count
Research context
- opencode uses provider-reported counts only (via Vercel AI SDK
usageresponse), with a 4 chars/token heuristic only for pruning tool outputs. No local tokenizer. - Ollama source (
runner/ollamarunner/runner.go):prompt_eval_count=seq.numPromptInputs= total prompt tokens before KV cache trimming. It IS the full prompt size, not incremental. /api/showresponse includes bothmodel_info.{arch}.context_length(architecture max) andparametersstring which may containnum_ctx(configured operational limit).
Related
- Follows from fix: redesign compaction — never alter chat history, fix token accuracy, use subagent for summarization #61 (compaction redesign) Part B remaining items
- See
docs/context-compaction-research.mdfor cross-tool comparison