bench: use BatchSize::PerIteration for cold-cache variants#2728
Merged
Conversation
Cold benchmark variants in `list.rs`, `remove.rs`, and `time_to_first_output.rs` were passing `BatchSize::SmallInput` to `iter_batched`. Under `SmallInput`, criterion (0.8) calls the setup closure once per batch up front, then runs the routines back-to-back inside a single timing window — so when the setup is `invalidate_caches_auto`, only iter 1 per batch is actually cold and iters 2-N read from the cache the previous routine just populated. The reported "cold" medians were warm-biased averages. `BatchSize::PerIteration` switches to `setup → time(routine)` per iter, so every measurement is genuinely cold. Setup is far cheaper than a `wt` subprocess, so per-iter `Instant::now` overhead doesn't dominate. Measured corrections on this branch: - `list full/cold/8`: spread 16.0 ms → 3.9 ms (4× tighter); median −5.7% (n.s., p=0.14) — warm-bias previously inflated upper outliers. - `remove_e2e/first_output`: median 48 → 86 ms (+44%, p<0.001). The routine writes is-ancestor/has-added-changes/merge-add-probe entries via `compute_integration_lazy`, so warm-bias was substantial. - `first_output/remove`: median 50 → 70 ms (+23%, p<0.001) with spread tightening 2.4 ms → 0.65 ms (3.7×). `skeleton/*` and `worktree_scaling/*` cold variants share the `run_benchmark` helper, so they pick up the fix too. The `real_repo` group's cold branch is also switched (separate call site). Other `BatchSize::SmallInput` sites in `remove.rs` (`no_hooks`, `with_hooks`) use `recreate_worktree` as setup — that doesn't invalidate a cache the routine repopulates, so they stay as-is. Likewise `BatchSize::PerIteration` was already in use on the `picker_preview` bench (#2721); this aligns the rest of the suite. `benches/CLAUDE.md` "Cache Handling" now points future bench authors at `PerIteration` so warm-bias doesn't silently recur on new cold variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
worktrunk-bot
approved these changes
May 12, 2026
2 tasks
max-sixty
added a commit
that referenced
this pull request
May 12, 2026
) ## Summary Follow-up to #2728. `benches/alias.rs` was the last `iter_batched(invalidate_caches_auto, …, BatchSize::SmallInput)` site in the bench suite. Under `SmallInput`, criterion calls the setup closure once per batch and times the routines back-to-back inside one timing window — so only iter 1 per batch is actually cold, and iters 2-N hit the `worktrunk.default-branch` git-config entry that `build_hook_context` wrote on iter 1 (default-branch resolution is ~17ms on a synthetic repo per `benches/CLAUDE.md`). `BatchSize::PerIteration` runs `setup → time(routine)` per iter, so every measurement is genuinely cold. ## Measurement `dispatch/cold/1` (stub alias, 1 worktree): | | Before | After | |---|---|---| | Median | 23.14 ms | 23.94 ms | | Range | [22.34, 24.01] | [22.91, 25.03] | Median +6.9% (p=0.05, "Change within noise threshold"). The correction is small — alias dispatch is cheap (~23ms) and only the default-branch cache is being repopulated, so each warm iter saved ~1ms — but the direction is right and it brings the bench in line with the rest of the suite (#2728) and the `benches/CLAUDE.md` "Cache Handling" guidance. ## Test plan - [x] `cargo run -- hook pre-merge --yes` — 3684 tests pass, 0 failed - [x] `dispatch/cold/1` run before+after on this branch with `--save-baseline` / `--baseline` Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cold benchmark variants in
list.rs,remove.rs, andtime_to_first_output.rswere passingBatchSize::SmallInputtoiter_batched. UnderSmallInput, criterion calls the setup closure once per batch up front and then runs the routines back-to-back inside a single timing window — so when setup isinvalidate_caches_auto, only iter 1 per batch is actually cold and iters 2-N read from the cache the previous routine just populated. The reported "cold" medians were warm-biased averages.BatchSize::PerIterationswitches tosetup → time(routine)per iter, so every measured iter is genuinely cold. The setup is far cheaper than awtsubprocess, so per-iterInstant::nowoverhead doesn't dominate. Codex flagged this on the siblingpicker_previewbench (#2721); that PR landedPerIterationfor its cold path, and this aligns the rest of the bench suite.Measured corrections
list full/cold/8remove_e2e/first_outputfirst_output/removeremove_e2e/first_outputis the starkest correction —compute_integration_lazywritesis-ancestor/has-added-changes/merge-add-probeentries, so warm-bias was substantial.first_output/removeis the cleanest demonstration of variance tightening (3.7× tighter spread).list full/cold/8doesn't move the median significantly but the upper-tail outliers that warm-bias was inflating disappear.Scope
Only the four
iter_batched(… invalidate_caches_auto, …, BatchSize::SmallInput)call sites are switched:benches/list.rsrun_benchmark(coversskeleton,full,worktree_scaling,many_branches,divergent_branchescold variants)benches/list.rsbench_real_repocold pathbenches/remove.rsfirst_outputbenches/time_to_first_output.rsremoveOther
BatchSize::SmallInputsites inremove.rs(no_hooks,with_hooks) userecreate_worktreeas setup — that doesn't invalidate a cache the routine repopulates, so they stay as-is.benches/CLAUDE.md"Cache Handling" gets a paragraph so future bench authors reach forPerIterationwhen the setup invalidates a cache the routine repopulates.Test plan
cargo run -- hook pre-merge --yes— 3667 tests pass, 0 failed--save-baseline/--baseline