fix: improve token counting accuracy

## Problem

Token counting has known inaccuracies that affect the sidebar display and compaction trigger timing:

1. **Heuristic overestimates by ~40%.** Our char-based heuristic (2.5 chars/token) estimated ~113k tokens for a conversation where `prompt_eval_count` reported ~80k. The real ratio for code-heavy conversations appears closer to 3.5-4 chars/token.

2. **No heuristic calibration.** After each model call, we have ground truth (`prompt_eval_count`) that could calibrate the heuristic for subsequent estimates, but we discard it.

3. **Tool schema overhead is a magic number.** `OVERHEAD_AGENT_LOOP = 6,000` is a guess. The actual tool JSON schemas should be measured once at agent init.

4. **Sidebar and agent loop use different overhead constants.** `OVERHEAD_SIDEBAR = 10,000` vs `OVERHEAD_AGENT_LOOP = 6,000` — these should be unified or at least derived from the same base measurement.

5. **Context limit from `/api/show` may not match operational limit.** The `context_length` from `model_info` reports the model architecture's maximum, but the actual limit depends on `num_ctx` configuration. We should parse `num_ctx` from the `parameters` field of `/api/show` responses and use it when present.

## Solution

1. **Calibrate heuristic with real counts.** After each model call that returns `prompt_eval_count`, compute `correction = realTokens / estimatedTokens`. Apply this to subsequent heuristic estimates (stored per-session, reset on model change).

2. **Measure tool schema overhead dynamically.** At agent init, serialize all tool schemas to JSON, count characters, estimate tokens. Use this instead of a constant.

3. **Unify overhead calculation.** Single function that computes overhead from system prompt + tool schemas. Both sidebar and agent loop call it.

4. **Parse `num_ctx` from `/api/show` parameters.** When present, use `min(context_length, num_ctx)` as the effective context limit.

5. **Better debug logging.** Log total message payload size (chars), estimated tokens, real tokens (when available), and the correction factor. This makes future debugging possible.

## Key files

- `src/lib/tokenizer.ts` — heuristic, overhead constants, `fetchModelInfo`
- `src/agent/index.ts` — `buildContextUsage()`, agent loop compaction check
- `src/tui/hooks/use-agent-context.ts` — sidebar stats, `OVERHEAD_SIDEBAR`
- `src/agent/stream-handler.ts` — captures `prompt_eval_count`/`eval_count`

## Research context

- **opencode** uses provider-reported counts only (via Vercel AI SDK `usage` response), with a 4 chars/token heuristic only for pruning tool outputs. No local tokenizer.
- **Ollama source** (`runner/ollamarunner/runner.go`): `prompt_eval_count` = `seq.numPromptInputs` = total prompt tokens before KV cache trimming. It IS the full prompt size, not incremental.
- `/api/show` response includes both `model_info.{arch}.context_length` (architecture max) and `parameters` string which may contain `num_ctx` (configured operational limit).

## Related

- Follows from #61 (compaction redesign) Part B remaining items
- See `docs/context-compaction-research.md` for cross-tool comparison


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve token counting accuracy #64

Problem

Solution

Key files

Research context

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix: improve token counting accuracy #64

Description

Problem

Solution

Key files

Research context

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions