bench: measure wt switch picker preview pre-compute workload#2721
Merged
Conversation
Add `picker_preview` benchmark group plus a `WORKTRUNK_PREVIEW_BENCH=1` env var that runs the picker prelude (collect, speculative spawn, skeleton, initial + deferred precompute) and exits after `orchestrator.wait_for_idle()` — before skim launches and before any JSON serialization or stderr drain. Headless measurement of "spawn → all preview tasks drained" without a PTY. The recent unified rayon pool / SHA cache / bench-harness work (#2662 / #2683 / #2685 / #2704) was tuned against `wt list` as a proxy because no direct picker measurement existed; this is that. Variants: `warm/typical-8` and `cold/typical-8`. Cold uses `BatchSize::PerIteration` (not `SmallInput`) so every iteration genuinely invalidates first — `SmallInput` batches setup up-front and runs timed routines back-to-back, biasing the cold measurement warm after the first iter in each batch repopulates the cache. `cfg(unix)`-gated with a no-op `main` on Windows; the picker is Unix-only and `wt switch` with no args hits the unavailable path before the env var is consulted. > _This was written by Claude Code on behalf of Maximilian Roos_ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
worktrunk-bot
approved these changes
May 11, 2026
2 tasks
max-sixty
added a commit
that referenced
this pull request
May 12, 2026
## Summary Cold benchmark variants in `list.rs`, `remove.rs`, and `time_to_first_output.rs` were passing `BatchSize::SmallInput` to `iter_batched`. Under `SmallInput`, criterion calls the setup closure once per batch up front and then runs the routines back-to-back inside a single timing window — so when setup is `invalidate_caches_auto`, only iter 1 per batch is actually cold and iters 2-N read from the cache the previous routine just populated. The reported "cold" medians were warm-biased averages. `BatchSize::PerIteration` switches to `setup → time(routine)` per iter, so every measured iter is genuinely cold. The setup is far cheaper than a `wt` subprocess, so per-iter `Instant::now` overhead doesn't dominate. Codex flagged this on the sibling `picker_preview` bench (#2721); that PR landed `PerIteration` for its cold path, and this aligns the rest of the bench suite. ## Measured corrections | Variant | Before median | After median | Spread (before → after) | Median Δ | |---|---|---|---|---| | `list full/cold/8` | 113.5 ms | 107.1 ms | 16.0 → 3.9 ms | −5.7% (n.s., p=0.14) | | `remove_e2e/first_output` | 48.2 ms | 86.4 ms | 2.6 → 17.9 ms | **+44.4%** (p<0.001) | | `first_output/remove` | 50.5 ms | 69.6 ms | 2.4 → 0.65 ms | **+23.0%** (p<0.001) | `remove_e2e/first_output` is the starkest correction — `compute_integration_lazy` writes `is-ancestor` / `has-added-changes` / `merge-add-probe` entries, so warm-bias was substantial. `first_output/remove` is the cleanest demonstration of variance tightening (3.7× tighter spread). `list full/cold/8` doesn't move the median significantly but the upper-tail outliers that warm-bias was inflating disappear. ## Scope Only the four `iter_batched(… invalidate_caches_auto, …, BatchSize::SmallInput)` call sites are switched: - `benches/list.rs` `run_benchmark` (covers `skeleton`, `full`, `worktree_scaling`, `many_branches`, `divergent_branches` cold variants) - `benches/list.rs` `bench_real_repo` cold path - `benches/remove.rs` `first_output` - `benches/time_to_first_output.rs` `remove` Other `BatchSize::SmallInput` sites in `remove.rs` (`no_hooks`, `with_hooks`) use `recreate_worktree` as setup — that doesn't invalidate a cache the routine repopulates, so they stay as-is. `benches/CLAUDE.md` "Cache Handling" gets a paragraph so future bench authors reach for `PerIteration` when the setup invalidates a cache the routine repopulates. ## Test plan - [x] `cargo run -- hook pre-merge --yes` — 3667 tests pass, 0 failed - [x] Each cold variant run before+after on this branch with `--save-baseline` / `--baseline` Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
picker_previewbenchmark group measuring "process spawn → all preview tasks drained" forwt switch's interactive picker.WORKTRUNK_PREVIEW_BENCH=1, an early-exit gate insidehandle_pickerthat runs the full prelude (collect, speculative spawn, skeleton, initial + deferred precompute,orchestrator.wait_for_idle()) and returns before skim launches or any JSON / stderr I/O. Shares the dry-run path; behavior with the env var unset is unchanged.cargo bench --bench list(lazy fixtures + lighter real-repo sampling) #2685 / perf(list): cache ahead/behind counts SHA-keyed; skip the for-each-ref walk on warm runs #2704, which were tuned againstwt listas a proxy because no direct picker measurement existed.Why this measurement
Picker submits one preview-compute task per row to the global rayon pool. The user-visible quantity to optimize is the responsiveness window between picker launch and "all previews ready" (j/k navigation hits cached content). Option 1 from the task — headless wall clock to drain — is the cleanest measurable proxy and avoids the PTY route, which hits the documented nextest/SIGTTOU pain on
shell-integration-tests. PTY-driven first-interactive-ready can be a follow-up.Variants
picker_preview/warm/typical-8picker_preview/cold/typical-8Cold uses
BatchSize::PerIteration(notSmallInput):SmallInputcallssetupfor an entire batch up front and then runs timed routines back-to-back, so only the first iter in each batch is genuinely cold — the rest hit a freshly populated.git/wt/cache/.PerIterationinvalidates immediately before every measured iteration; setup is far cheaper thanwt switch, so per-iterInstant::nowdoesn't dominate.sample_size(10)+measurement_time(35s)per #2685's lead — slow benches don't benefit from the default 30 samples.cfg(unix)-gated with a no-opmainon Windows; the picker is Unix-only andwt switch(no args) hits the unavailable path before the env var is consulted.Sample run
Test plan
cargo bench --bench picker_previewruns cleanly on both variantscargo run -- hook pre-merge --yes— 3667 tests passtest_picker_preview_bench_produces_no_outputassertsWORKTRUNK_PREVIEW_BENCH=1keeps stdout/stderr empty (covers the env-gated branch, locks the no-I/O contract)wt switchwithWORKTRUNK_PREVIEW_BENCHunset still hits the TTY error path (user-visible behavior unchanged)WORKTRUNK_PICKER_DRY_RUN=1still emits the cache JSON dump (regression check)/review-codexpass clean after iterating on three findings (packed-refs fix already onmainvia fix(bench): stop deleting packed-refs in invalidate_caches_auto #2697 once branch was rebased;BatchSize::PerIterationfor true per-iter invalidation;cfg(unix)gate for Windows)🤖 Generated with Claude Code