| title | Cost benchmark |
|---|---|
| description | Methodology and measured results for ZO's --low-token cost-saving preset, benchmarked against the MNIST end-to-end reference run. |
This page documents the methodology and results for the --low-token cost benchmark, a controlled comparison between default and low-token modes on ZO's canonical MNIST end-to-end reference run.
MNIST is ZO's canonical reference because:
- Stable oracle: the must-pass tier (95% test accuracy) is well within the architecture's capability
- Six full phases: exercises Phase 1 (data) through Phase 6 (packaging), the full lifecycle
- Phase-4 iterations matter: non-trivially, the autonomous loop matters for cost
- Reproducible: the plan, data, and oracle don't drift between runs
- Documented baseline: session-005 gave us a precise cost ($11) and wall-time (~50min) anchor point
The benchmark runs zo build against the MNIST plan twice in identical environments:
- Default run:
zo build plans/mnist-digit-classifier.md, Opus lead, max-iterations 10, supervised gates auto-PROCEED-ed via--gate-mode full-auto - Low-token run:
zo build plans/mnist-digit-classifier.md --low-token, Sonnet lead, 2 iterations, full-auto gates, no headlines
For fair comparison, both runs use --gate-mode full-auto so the only differences are the low-token knobs themselves. Results are logged to ~/.claude/projects/*.jsonl; per-session totals are extracted via npx ccusage --json (one of ZO's planned optional integrations).
| Metric | How |
|---|---|
| Input tokens (lead) | Sum across all turns of the lead orchestrator |
| Output tokens (lead) | Same |
| Input/output tokens (sub-agents) | Sum across all spawned agents |
| Headline tokens | Sum of Haiku ticker calls |
| Wall time | From zo build start to session end |
| Oracle tier reached | must_pass, should_pass, or could_pass, ensures quality didn't regress unacceptably |
| Phase 4 iterations completed | Diagnostic, how often does each mode hit its iteration cap |
| Total cost (USD) | Computed from token totals × Anthropic published rates at benchmark time |
- Same machine (Mac M-series, 32GB RAM)
- Same Claude Code version
- Same MNIST plan, unmodified between runs
- Same delivery scaffold (fresh
zo init) - Both runs
--gate-mode full-auto(default would have beensupervisedfor the default-mode run; we override to keep gate behaviour identical)
The benchmark is single-shot per release. Re-running on every PR would cost ~$13 each cycle and provide minimal signal, variance is dominated by stochastic agent decisions, not measurement noise.
First completed end-to-end bench against the MNIST plan, captured via npx ccusage --instances --since "$(date -u +%Y%m%d)":
| Metric | Default (historical) | Low-token (measured) | Reduction |
|---|---|---|---|
| Lead orchestrator | ~$X.XX (Opus, 10 iter cap) | $4.48 (Sonnet 4.6) | Lead model swap is the dominant lever |
| Sub-agents (data-engineer, code-reviewer, test-engineer) | Sonnet (already) | $3.27 (Sonnet 4.6) | Same model, no token-rate saving here |
| Phase-4 iterations completed | 1 (MNIST converges fast) | 1 | Iteration cap didn't bite (MNIST converges in 1) |
| Headline tokens | ~30 calls × ~1.5 KB | 0 | ~$0.01 saved |
| Total cost | ~$11 | $7.75 | ~30% reduction |
| Wall time | ~50 min | ~75 min (incl. one stuck-gate cycle) | Mixed, see Caveats |
| Oracle tier reached | could_pass (99.66%) |
(training completed at 98.83%, full pipeline cycle did not complete; see Findings) | Same model quality at the lead step |
The headline number, 30%, is materially smaller than the 70-80% projected before measurement. The next two sections explain why.
The earlier 70-80% estimate assumed --low-token would swap every agent from Opus to Sonnet (~5× cheaper per token), so the savings would be ~80% across the board.
That assumption was wrong. Default mode never had every agent on Opus:
| Role | Default mode | --low-token mode |
Token spend share |
|---|---|---|---|
| Lead orchestrator | Opus 4.6 | Sonnet 4.6 | ~30-40% of total |
| Data engineer | Sonnet (per .md frontmatter) |
Sonnet | ~20-25% |
| Code reviewer | Sonnet | Sonnet | ~10-15% |
| Test engineer | Sonnet | Sonnet | ~10-15% |
| Oracle / QA | Sonnet | Sonnet | ~5-10% |
| Other sub-agents (xai, domain-eval, etc.) | Sonnet | Sonnet | remainder |
So --low-token only affects the lead's cost share (~30-40% of total). At ~5× cheaper-per-token on Sonnet, that's a ceiling of ~25-30% savings on the total, which is exactly what we measured.
The original projection extrapolated the per-token ratio (5×) to the whole run. The whole run was already mostly Sonnet. There was no 5× savings hiding in the sub-agents because their cost rate wasn't changing.
The 30% breaks down across the preset's seven knobs:
| Knob | Mechanism | Approximate contribution to the 30% |
|---|---|---|
| Lead Opus → Sonnet | ~5× per-token rate, ~30-40% cost share | ~20-25 pp |
Phase-4 max_iterations 10 → 2 |
Caps iteration count | ~2-5 pp (MNIST converges in 1, so cap doesn't bite) |
stop_on_tier must_pass → could_pass |
Earlier termination | ~1-3 pp (same as above for MNIST) |
Drop research-scout cross-cutting |
~6 spawns × small contracts | ~1-2 pp |
| No Haiku headline ticker | ~60 small calls/hour | under 1 pp (~$0.01) |
Default gate-mode full-auto |
No human-loop wall-time overhead | wall-time only, not token |
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60 |
Earlier auto-compaction | indirect (cleaner reasoning, no direct $ saving) |
For plans that hit the iteration cap on default mode (harder problems), the iteration knob would matter more. MNIST is too easy, it converges in iteration 1, so the cap → 2 saving is essentially zero.
The first bench measured ~30% with the lead-only swap. Two additional levers shipped post-bench target a ~50-60% ceiling without an SDK refactor; three further architectural levers target ~70-80%.
| Lever | Mechanism | Estimated additional savings | Status |
|---|---|---|---|
| Sub-agent model right-sizing | Two-tier routing: code-reviewer, test-engineer, oracle-qa → Haiku 4.5 (pattern-matching tasks); reasoning agents stay on Sonnet. Lead instructed via _prompt_low_token_overrides() |
~10-15% (Haiku is ~3× cheaper than Sonnet) | Shipped, LOW_TOKEN_HAIKU_AGENTS in src/zo/_orchestrator_phases.py |
| Phase-1 trim | Phase 1 was ~45% of first bench cost ($3.47 of $7.75). Drop code-reviewer, test-engineer, domain-evaluator from Phase 1 in low-token mode; defer to Gate 5 final pass. Just data-engineer runs |
~10-15% | Shipped, LOW_TOKEN_PHASE_DROPS["phase_1"] |
| Phase-5 trim | Drop xai-agent and domain-evaluator from Phase 5; lead writes single-shot analysis summary instead of dedicated explainability + domain-validation pass |
~3-5% | Shipped, LOW_TOKEN_PHASE_DROPS["phase_5"] |
A second bench post-PR-C is required to confirm whether the 50-60% target is hit. The headline number on this page updates when that bench lands.
To break past the ~50-60% ceiling, the path forward requires moving ZO from claude CLI subprocess to direct Anthropic SDK so ZO can control caching, batching, and file uploads programmatically:
| Lever | Mechanism | Potential additional savings | Library / API |
|---|---|---|---|
| Prompt caching | cache_control markers on plan/specs/agent roster (5-min TTL). About half the bench's tokens were cache reads at ~$0.30/M; explicit cache_control should bring that to ~$0.30 of every $3 of input |
~50-60% on the cache-read portion (~half of total) | Anthropic Python SDK direct |
| Batch API | 50% discount for async jobs. Suitable for the Haiku ticker, end-of-session summary, and any non-time-critical evaluation rounds | 50% off batched calls | Anthropic Batch API |
| Files API | Upload plan + specs once, reference by ID. Eliminates repeated upload of the same large context blocks | ~10-15% on input tokens for large plans | Anthropic Files API |
The 70-80% figure was always achievable, just not via the v1 preset alone. The SDK refactor is multi-week and lands in v1.1.
Beyond the headline cost number, the first bench surfaced material findings:
Confirmed working:
- Lead orchestrator runs on Sonnet under
--low-token.ps auxduring the run showedclaude --model sonnet --max-turns 200for the lead. - Sub-agent model override works. All three spawned sub-agents (data-engineer, code-reviewer, test-engineer) ran on
claude-sonnet-4-6perps aux. The orchestrator's_prompt_low_token_overrides()instruction to passmodel="claude-sonnet-4-6"to everyAgent()call is honoured by Claude Code 2.1.107. The earlier "Claude Code 2.1.92 ignores the param" hypothesis is no longer load-bearing: either the bug was version-specific or the prompt-level override always was sufficient. No SDK refactor needed for this piece.
Contract violation discovered (now fixed):
zo watch-trainingrendered "Waiting for training to start..." for the entire run despite training completing at 98.83% test accuracy. Three stacked contract violations: (a)model-builder.mdhad two contradictory paths for training metrics (.zo/experiments/<exp_id>/vslogs/training/); (b)wrapper.pyandcli.py:watch_traininghardcoded the wrong path; (c) the Phase 4 gate was aspirational: only checkedresult.md, not the actualZOTrainingCallbackartifacts. Sonnet (low-token) ignored both contradictory instructions and wrote a vanilla PyTorch JSON dump to a third invented path.- Fix shipped: new
resolve_active_experiment_dir()helper as canonical resolver;cli.pyandwrapper.pyconsume it; orchestrator's gate now hard-fails whenmetrics.jsonlandtraining_status.jsonare missing alongsideresult.md. Captured in PRIORS.md as PR-035: aspirational agent contracts get ignored under sub-optimal models: hard gate enforcement is mandatory.
-
MNIST is an easy benchmark. The convergence-iteration cap (10 → 2) doesn't bite on MNIST because the model converges in iteration 1. On a harder problem, the iteration cap matters more, and
--low-tokenmay fail to converge where default would have succeeded. Re-run without--low-tokenifzo statusshowsBUDGET_EXHAUSTED. -
Lead model swap doesn't show MNIST quality regression. MNIST is well within Sonnet's capability for plan decomposition. On novel research-grade plans, Opus may catch nuances Sonnet misses.
-
Wall-time savings are not all from tokens. ~30% of the wall-time saving comes from skipped human-loop overhead (
--gate-mode full-autodefault in low-token), not from compute savings. If you keep--gate-mode supervised, wall time savings shrink even though token savings remain. -
Cost depends on the Anthropic rate card at benchmark time. Pricing changes ripple through the total. The methodology measures token counts primarily; cost in USD is derived.
-
Pro-plan caps measure messages, not tokens. Anthropic Pro subscribers hit a daily message cap, not a token cap. Low-token mode reduces both per-message tokens AND total messages (no Haiku ticker, no end-of-session summary). The cap savings are roughly proportional but not identical to token savings.
The harness lives at scripts/benchmark_low_token.sh. Usage:
# From the ZO repo root, with ccusage installed (`npm install -g ccusage`):
./scripts/benchmark_low_token.sh
# Or specify a delivery prefix:
./scripts/benchmark_low_token.sh --delivery-prefix /tmp/zo-benchThe script:
- Runs
zo init mnist-bench-default --no-tmux ...to scaffold a fresh delivery repo - Captures pre-build ccusage snapshot
- Runs
zo buildin default mode + waits for completion - Captures post-build ccusage snapshot, diff = default-mode tokens
- Runs
zo init mnist-bench-low --no-tmux ...for the low-token delivery - Pre-build ccusage snapshot
- Runs
zo build --low-token+ waits - Post-build ccusage snapshot, diff = low-token tokens
- Writes
benchmark-results-{timestamp}.jsonwith the comparison - Prints a summary table
Total wall time: ~75 minutes (default ~50 + low-token ~25). Total cost: $13-14 ($11 + ~$2-3).
This page is updated when a new ZO version changes anything that materially affects the cost profile (preset values, agent roster, prompt structure). The history of measurements over time gives a longitudinal view of how cost-per-build evolves with the platform.
| ZO version | Date | Default cost | Low-token cost | Reduction | Source |
|---|---|---|---|---|---|
| 1.0.2 + low-token | 2026-04-27 | ~$11 (historical) | $7.75 (measured) | ~30% | First measured bench, MNIST plan, ccusage --instances |