zero-operators/docs/reference/cost-benchmark.mdx at main · SamPlvs/zero-operators

title	Cost benchmark
description	Methodology and measured results for ZO's --low-token cost-saving preset, benchmarked against the MNIST end-to-end reference run.

This page documents the methodology and results for the --low-token cost benchmark, a controlled comparison between default and low-token modes on ZO's canonical MNIST end-to-end reference run.

**Status:** measured on 2026-04-27 against the canonical MNIST plan. Result: low-token mode landed at **\$7.75** end-to-end (Sonnet lead \$4.48 + Sonnet sub-agents \$3.27, captured via `npx ccusage --instances`) vs. the historical default-mode reference of **~\$11**. That's a **~30% reduction**, not the 70-80% projected before measurement. The structural reason is below, see [Why the savings ceiling is structural](#why-the-savings-ceiling-is-structural). The earlier 70-80% projection assumed every agent would swap from Opus to Sonnet; in practice only the lead is Opus by default (sub-agents already run on Sonnet via their `.md` frontmatter), so `--low-token` only affects the lead's ~30-40% cost share.

Why benchmark MNIST

MNIST is ZO's canonical reference because:

Stable oracle: the must-pass tier (95% test accuracy) is well within the architecture's capability
Six full phases: exercises Phase 1 (data) through Phase 6 (packaging), the full lifecycle
Phase-4 iterations matter: non-trivially, the autonomous loop matters for cost
Reproducible: the plan, data, and oracle don't drift between runs
Documented baseline: session-005 gave us a precise cost ($11) and wall-time (~50min) anchor point

Methodology

The benchmark runs zo build against the MNIST plan twice in identical environments:

Default run: zo build plans/mnist-digit-classifier.md, Opus lead, max-iterations 10, supervised gates auto-PROCEED-ed via --gate-mode full-auto
Low-token run: zo build plans/mnist-digit-classifier.md --low-token, Sonnet lead, 2 iterations, full-auto gates, no headlines

For fair comparison, both runs use --gate-mode full-auto so the only differences are the low-token knobs themselves. Results are logged to ~/.claude/projects/*.jsonl; per-session totals are extracted via npx ccusage --json (one of ZO's planned optional integrations).

Measured dimensions

Metric	How
Input tokens (lead)	Sum across all turns of the lead orchestrator
Output tokens (lead)	Same
Input/output tokens (sub-agents)	Sum across all spawned agents
Headline tokens	Sum of Haiku ticker calls
Wall time	From `zo build` start to session end
Oracle tier reached	`must_pass`, `should_pass`, or `could_pass`, ensures quality didn't regress unacceptably
Phase 4 iterations completed	Diagnostic, how often does each mode hit its iteration cap
Total cost (USD)	Computed from token totals × Anthropic published rates at benchmark time

Controlled variables

Same machine (Mac M-series, 32GB RAM)
Same Claude Code version
Same MNIST plan, unmodified between runs
Same delivery scaffold (fresh zo init)
Both runs --gate-mode full-auto (default would have been supervised for the default-mode run; we override to keep gate behaviour identical)

Run frequency

The benchmark is single-shot per release. Re-running on every PR would cost ~$13 each cycle and provide minimal signal, variance is dominated by stochastic agent decisions, not measurement noise.

Measured results (2026-04-27)

First completed end-to-end bench against the MNIST plan, captured via npx ccusage --instances --since "$(date -u +%Y%m%d)":

Metric	Default (historical)	Low-token (measured)	Reduction
Lead orchestrator	~$X.XX (Opus, 10 iter cap)	$4.48 (Sonnet 4.6)	Lead model swap is the dominant lever
Sub-agents (data-engineer, code-reviewer, test-engineer)	Sonnet (already)	$3.27 (Sonnet 4.6)	Same model, no token-rate saving here
Phase-4 iterations completed	1 (MNIST converges fast)	1	Iteration cap didn't bite (MNIST converges in 1)
Headline tokens	~30 calls × ~1.5 KB	0	~$0.01 saved
Total cost	~$11	$7.75	~30% reduction
Wall time	~50 min	~75 min (incl. one stuck-gate cycle)	Mixed, see Caveats
Oracle tier reached	`could_pass` (99.66%)	(training completed at 98.83%, full pipeline cycle did not complete; see Findings)	Same model quality at the lead step

The headline number, 30%, is materially smaller than the 70-80% projected before measurement. The next two sections explain why.

Why the savings ceiling is structural

The earlier 70-80% estimate assumed --low-token would swap every agent from Opus to Sonnet (~5× cheaper per token), so the savings would be ~80% across the board.

That assumption was wrong. Default mode never had every agent on Opus:

Role	Default mode	`--low-token` mode	Token spend share
Lead orchestrator	Opus 4.6	Sonnet 4.6	~30-40% of total
Data engineer	Sonnet (per `.md` frontmatter)	Sonnet	~20-25%
Code reviewer	Sonnet	Sonnet	~10-15%
Test engineer	Sonnet	Sonnet	~10-15%
Oracle / QA	Sonnet	Sonnet	~5-10%
Other sub-agents (xai, domain-eval, etc.)	Sonnet	Sonnet	remainder

So --low-token only affects the lead's cost share (~30-40% of total). At ~5× cheaper-per-token on Sonnet, that's a ceiling of ~25-30% savings on the total, which is exactly what we measured.

The original projection extrapolated the per-token ratio (5×) to the whole run. The whole run was already mostly Sonnet. There was no 5× savings hiding in the sub-agents because their cost rate wasn't changing.

What `--low-token` actually moves

The 30% breaks down across the preset's seven knobs:

Knob	Mechanism	Approximate contribution to the 30%
Lead Opus → Sonnet	~5× per-token rate, ~30-40% cost share	~20-25 pp
Phase-4 `max_iterations` 10 → 2	Caps iteration count	~2-5 pp (MNIST converges in 1, so cap doesn't bite)
`stop_on_tier` `must_pass` → `could_pass`	Earlier termination	~1-3 pp (same as above for MNIST)
Drop `research-scout` cross-cutting	~6 spawns × small contracts	~1-2 pp
No Haiku headline ticker	~60 small calls/hour	under 1 pp (~$0.01)
Default gate-mode `full-auto`	No human-loop wall-time overhead	wall-time only, not token
`CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60`	Earlier auto-compaction	indirect (cleaner reasoning, no direct $ saving)

For plans that hit the iteration cap on default mode (harder problems), the iteration knob would matter more. MNIST is too easy, it converges in iteration 1, so the cap → 2 saving is essentially zero.

What would push savings higher

The first bench measured ~30% with the lead-only swap. Two additional levers shipped post-bench target a ~50-60% ceiling without an SDK refactor; three further architectural levers target ~70-80%.

Shipped post-first-bench (target: ~50-60%)

Lever	Mechanism	Estimated additional savings	Status
Sub-agent model right-sizing	Two-tier routing: `code-reviewer`, `test-engineer`, `oracle-qa` → Haiku 4.5 (pattern-matching tasks); reasoning agents stay on Sonnet. Lead instructed via `_prompt_low_token_overrides()`	~10-15% (Haiku is ~3× cheaper than Sonnet)	Shipped, `LOW_TOKEN_HAIKU_AGENTS` in `src/zo/_orchestrator_phases.py`
Phase-1 trim	Phase 1 was ~45% of first bench cost ($3.47 of $7.75). Drop `code-reviewer`, `test-engineer`, `domain-evaluator` from Phase 1 in low-token mode; defer to Gate 5 final pass. Just data-engineer runs	~10-15%	Shipped, `LOW_TOKEN_PHASE_DROPS["phase_1"]`
Phase-5 trim	Drop `xai-agent` and `domain-evaluator` from Phase 5; lead writes single-shot analysis summary instead of dedicated explainability + domain-validation pass	~3-5%	Shipped, `LOW_TOKEN_PHASE_DROPS["phase_5"]`

A second bench post-PR-C is required to confirm whether the 50-60% target is hit. The headline number on this page updates when that bench lands.

Architectural: not yet shipped (target: ~70-80%)

To break past the ~50-60% ceiling, the path forward requires moving ZO from claude CLI subprocess to direct Anthropic SDK so ZO can control caching, batching, and file uploads programmatically:

Lever	Mechanism	Potential additional savings	Library / API
Prompt caching	`cache_control` markers on plan/specs/agent roster (5-min TTL). About half the bench's tokens were cache reads at ~$0.30/M; explicit cache_control should bring that to ~$0.30 of every $3 of input	~50-60% on the cache-read portion (~half of total)	Anthropic Python SDK direct
Batch API	50% discount for async jobs. Suitable for the Haiku ticker, end-of-session summary, and any non-time-critical evaluation rounds	50% off batched calls	Anthropic Batch API
Files API	Upload plan + specs once, reference by ID. Eliminates repeated upload of the same large context blocks	~10-15% on input tokens for large plans	Anthropic Files API

The 70-80% figure was always achievable, just not via the v1 preset alone. The SDK refactor is multi-week and lands in v1.1.

Findings from the first measured run (2026-04-27)

Beyond the headline cost number, the first bench surfaced material findings:

Confirmed working:

Lead orchestrator runs on Sonnet under --low-token. ps aux during the run showed claude --model sonnet --max-turns 200 for the lead.
Sub-agent model override works. All three spawned sub-agents (data-engineer, code-reviewer, test-engineer) ran on claude-sonnet-4-6 per ps aux. The orchestrator's _prompt_low_token_overrides() instruction to pass model="claude-sonnet-4-6" to every Agent() call is honoured by Claude Code 2.1.107. The earlier "Claude Code 2.1.92 ignores the param" hypothesis is no longer load-bearing: either the bug was version-specific or the prompt-level override always was sufficient. No SDK refactor needed for this piece.

Contract violation discovered (now fixed):

zo watch-training rendered "Waiting for training to start..." for the entire run despite training completing at 98.83% test accuracy. Three stacked contract violations: (a) model-builder.md had two contradictory paths for training metrics (.zo/experiments/<exp_id>/ vs logs/training/); (b) wrapper.py and cli.py:watch_training hardcoded the wrong path; (c) the Phase 4 gate was aspirational: only checked result.md, not the actual ZOTrainingCallback artifacts. Sonnet (low-token) ignored both contradictory instructions and wrote a vanilla PyTorch JSON dump to a third invented path.
Fix shipped: new resolve_active_experiment_dir() helper as canonical resolver; cli.py and wrapper.py consume it; orchestrator's gate now hard-fails when metrics.jsonl and training_status.json are missing alongside result.md. Captured in PRIORS.md as PR-035: aspirational agent contracts get ignored under sub-optimal models: hard gate enforcement is mandatory.

Caveats and known limitations

MNIST is an easy benchmark. The convergence-iteration cap (10 → 2) doesn't bite on MNIST because the model converges in iteration 1. On a harder problem, the iteration cap matters more, and --low-token may fail to converge where default would have succeeded. Re-run without --low-token if zo status shows BUDGET_EXHAUSTED.
Lead model swap doesn't show MNIST quality regression. MNIST is well within Sonnet's capability for plan decomposition. On novel research-grade plans, Opus may catch nuances Sonnet misses.
Wall-time savings are not all from tokens. ~30% of the wall-time saving comes from skipped human-loop overhead (--gate-mode full-auto default in low-token), not from compute savings. If you keep --gate-mode supervised, wall time savings shrink even though token savings remain.
Cost depends on the Anthropic rate card at benchmark time. Pricing changes ripple through the total. The methodology measures token counts primarily; cost in USD is derived.
Pro-plan caps measure messages, not tokens. Anthropic Pro subscribers hit a daily message cap, not a token cap. Low-token mode reduces both per-message tokens AND total messages (no Haiku ticker, no end-of-session summary). The cap savings are roughly proportional but not identical to token savings.

Reproducing the benchmark

The harness lives at scripts/benchmark_low_token.sh. Usage:

# From the ZO repo root, with ccusage installed (`npm install -g ccusage`):
./scripts/benchmark_low_token.sh

# Or specify a delivery prefix:
./scripts/benchmark_low_token.sh --delivery-prefix /tmp/zo-bench

The script:

Runs zo init mnist-bench-default --no-tmux ... to scaffold a fresh delivery repo
Captures pre-build ccusage snapshot
Runs zo build in default mode + waits for completion
Captures post-build ccusage snapshot, diff = default-mode tokens
Runs zo init mnist-bench-low --no-tmux ... for the low-token delivery
Pre-build ccusage snapshot
Runs zo build --low-token + waits
Post-build ccusage snapshot, diff = low-token tokens
Writes benchmark-results-{timestamp}.json with the comparison
Prints a summary table

Total wall time: ~75 minutes (default ~50 + low-token ~25). Total cost: ~~$13-14 (~~$11 + ~$2-3).

**The benchmark spends real tokens.** Don't run it on a free tier. The methodology assumes API access (or a Max plan with comfortable headroom).

Updates

This page is updated when a new ZO version changes anything that materially affects the cost profile (preset values, agent roster, prompt structure). The history of measurements over time gives a longitudinal view of how cost-per-build evolves with the platform.

ZO version	Date	Default cost	Low-token cost	Reduction	Source
1.0.2 + low-token	2026-04-27	~$11 (historical)	$7.75 (measured)	~30%	First measured bench, MNIST plan, ccusage `--instances`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why benchmark MNIST

Methodology

Measured dimensions

Controlled variables

Run frequency

Measured results (2026-04-27)

Why the savings ceiling is structural

What `--low-token` actually moves

What would push savings higher

Shipped post-first-bench (target: ~50-60%)

Architectural: not yet shipped (target: ~70-80%)

Findings from the first measured run (2026-04-27)

Caveats and known limitations

Reproducing the benchmark

Updates

See also

FilesExpand file tree

cost-benchmark.mdx

Latest commit

History

cost-benchmark.mdx

File metadata and controls

Why benchmark MNIST

Methodology

Measured dimensions

Controlled variables

Run frequency

Measured results (2026-04-27)

Why the savings ceiling is structural

What --low-token actually moves

What would push savings higher

Shipped post-first-bench (target: ~50-60%)

Architectural: not yet shipped (target: ~70-80%)

Findings from the first measured run (2026-04-27)

Caveats and known limitations

Reproducing the benchmark

Updates

See also

What `--low-token` actually moves