Skip to content

bench: use BatchSize::PerIteration for cold-cache variants#2728

Merged
max-sixty merged 1 commit into
mainfrom
bench-cold-per-iteration
May 12, 2026
Merged

bench: use BatchSize::PerIteration for cold-cache variants#2728
max-sixty merged 1 commit into
mainfrom
bench-cold-per-iteration

Conversation

@max-sixty

Copy link
Copy Markdown
Owner

Summary

Cold benchmark variants in list.rs, remove.rs, and time_to_first_output.rs were passing BatchSize::SmallInput to iter_batched. Under SmallInput, criterion calls the setup closure once per batch up front and then runs the routines back-to-back inside a single timing window — so when setup is invalidate_caches_auto, only iter 1 per batch is actually cold and iters 2-N read from the cache the previous routine just populated. The reported "cold" medians were warm-biased averages.

BatchSize::PerIteration switches to setup → time(routine) per iter, so every measured iter is genuinely cold. The setup is far cheaper than a wt subprocess, so per-iter Instant::now overhead doesn't dominate. Codex flagged this on the sibling picker_preview bench (#2721); that PR landed PerIteration for its cold path, and this aligns the rest of the bench suite.

Measured corrections

Variant Before median After median Spread (before → after) Median Δ
list full/cold/8 113.5 ms 107.1 ms 16.0 → 3.9 ms −5.7% (n.s., p=0.14)
remove_e2e/first_output 48.2 ms 86.4 ms 2.6 → 17.9 ms +44.4% (p<0.001)
first_output/remove 50.5 ms 69.6 ms 2.4 → 0.65 ms +23.0% (p<0.001)

remove_e2e/first_output is the starkest correction — compute_integration_lazy writes is-ancestor / has-added-changes / merge-add-probe entries, so warm-bias was substantial. first_output/remove is the cleanest demonstration of variance tightening (3.7× tighter spread). list full/cold/8 doesn't move the median significantly but the upper-tail outliers that warm-bias was inflating disappear.

Scope

Only the four iter_batched(… invalidate_caches_auto, …, BatchSize::SmallInput) call sites are switched:

  • benches/list.rs run_benchmark (covers skeleton, full, worktree_scaling, many_branches, divergent_branches cold variants)
  • benches/list.rs bench_real_repo cold path
  • benches/remove.rs first_output
  • benches/time_to_first_output.rs remove

Other BatchSize::SmallInput sites in remove.rs (no_hooks, with_hooks) use recreate_worktree as setup — that doesn't invalidate a cache the routine repopulates, so they stay as-is.

benches/CLAUDE.md "Cache Handling" gets a paragraph so future bench authors reach for PerIteration when the setup invalidates a cache the routine repopulates.

Test plan

  • cargo run -- hook pre-merge --yes — 3667 tests pass, 0 failed
  • Each cold variant run before+after on this branch with --save-baseline / --baseline

Cold benchmark variants in `list.rs`, `remove.rs`, and
`time_to_first_output.rs` were passing `BatchSize::SmallInput` to
`iter_batched`. Under `SmallInput`, criterion (0.8) calls the setup
closure once per batch up front, then runs the routines back-to-back
inside a single timing window — so when the setup is
`invalidate_caches_auto`, only iter 1 per batch is actually cold and
iters 2-N read from the cache the previous routine just populated. The
reported "cold" medians were warm-biased averages.

`BatchSize::PerIteration` switches to `setup → time(routine)` per iter,
so every measurement is genuinely cold. Setup is far cheaper than a
`wt` subprocess, so per-iter `Instant::now` overhead doesn't dominate.

Measured corrections on this branch:

- `list full/cold/8`: spread 16.0 ms → 3.9 ms (4× tighter); median
  −5.7% (n.s., p=0.14) — warm-bias previously inflated upper outliers.
- `remove_e2e/first_output`: median 48 → 86 ms (+44%, p<0.001). The
  routine writes is-ancestor/has-added-changes/merge-add-probe entries
  via `compute_integration_lazy`, so warm-bias was substantial.
- `first_output/remove`: median 50 → 70 ms (+23%, p<0.001) with spread
  tightening 2.4 ms → 0.65 ms (3.7×).

`skeleton/*` and `worktree_scaling/*` cold variants share the
`run_benchmark` helper, so they pick up the fix too. The `real_repo`
group's cold branch is also switched (separate call site).

Other `BatchSize::SmallInput` sites in `remove.rs` (`no_hooks`,
`with_hooks`) use `recreate_worktree` as setup — that doesn't
invalidate a cache the routine repopulates, so they stay as-is.
Likewise `BatchSize::PerIteration` was already in use on the
`picker_preview` bench (#2721); this aligns the rest of the suite.

`benches/CLAUDE.md` "Cache Handling" now points future bench authors at
`PerIteration` so warm-bias doesn't silently recur on new cold variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@max-sixty max-sixty merged commit d7c8428 into main May 12, 2026
34 checks passed
@max-sixty max-sixty deleted the bench-cold-per-iteration branch May 12, 2026 00:50
max-sixty added a commit that referenced this pull request May 12, 2026
)

## Summary

Follow-up to #2728. `benches/alias.rs` was the last
`iter_batched(invalidate_caches_auto, …, BatchSize::SmallInput)` site in
the bench suite. Under `SmallInput`, criterion calls the setup closure
once per batch and times the routines back-to-back inside one timing
window — so only iter 1 per batch is actually cold, and iters 2-N hit
the `worktrunk.default-branch` git-config entry that
`build_hook_context` wrote on iter 1 (default-branch resolution is ~17ms
on a synthetic repo per `benches/CLAUDE.md`). `BatchSize::PerIteration`
runs `setup → time(routine)` per iter, so every measurement is genuinely
cold.

## Measurement

`dispatch/cold/1` (stub alias, 1 worktree):

| | Before | After |
|---|---|---|
| Median | 23.14 ms | 23.94 ms |
| Range | [22.34, 24.01] | [22.91, 25.03] |

Median +6.9% (p=0.05, "Change within noise threshold"). The correction
is small — alias dispatch is cheap (~23ms) and only the default-branch
cache is being repopulated, so each warm iter saved ~1ms — but the
direction is right and it brings the bench in line with the rest of the
suite (#2728) and the `benches/CLAUDE.md` "Cache Handling" guidance.

## Test plan

- [x] `cargo run -- hook pre-merge --yes` — 3684 tests pass, 0 failed
- [x] `dispatch/cold/1` run before+after on this branch with
`--save-baseline` / `--baseline`

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants