test(perf): gate the durable per-parent outbox drain (WARM walltime + Tier 2 fsync count)#1235
Conversation
… Tier 2 fsync count) The outbox drain (DrainInboxForParent, run on every conductor Stop hook and heartbeat — asheshgoplani#1226/asheshgoplani#1227) was the highest-ROI ungated hot path from the deferred list in asheshgoplani#790/asheshgoplani#1047. This adds the first TestPerf_* gates for it, per the mandate in docs/perf-budget-suite.md. - internal/session/outbox_drain_perf_test.go: - TestPerf_OutboxDrain_Drain25 (Tier 1, WARM): TrimmedMeanWithSetup over a representative 25-child inbox (N from the live drain test's `const n = 25`). fsync is stubbed to a no-op via the new seam so the timed window is pure in-process Go cost — the WARM/tmpfs contract holds on any filesystem, not just CI tmpfs. base = 2.1ms observed local median (-race, mult 1.0); budget = 6.3ms local / 12.6ms CI. - TestPerf_OutboxDrain_FsyncCount (Tier 2, count assertion): the backing is *os.File (writeFileDurable), so the doc's count-assertion tier applies. The drain issues exactly 2 file + 2 dir fsyncs for ANY non-empty drain (WAL stage + consumed-ledger persist, each batched), proven identical for N=1 and N=25 — a fsync-per-record regression would scale the file count with N. - internal/session/inbox.go: add the `fsyncFile` seam (default (*os.File).Sync, zero prod overhead) routing writeFileDurable + fsyncDir, plus SetFsyncHookForTest. This is the seam docs/perf-budget-suite.md anticipates for Tier 2 count gates. - internal/testutil/fsynccounter.go: reusable FsyncCounter wrapper (buckets file vs directory syncs) — the shared Tier 2 instrumentation the doc frames as infrastructure for every future persistence-touching gate. - Makefile + docs/perf-budget-suite.md: align `make test-perf` scope to ./... (matching .github/workflows/perf-smoke.yml) so a TestPerf_* in any package — e.g. this internal/session gate — is exercised, not just cmd/agent-deck. No production behavior change: the default fsync seam is (*os.File).Sync. Verification: - make test-perf @1.0 and @2.0: both new gates green (drain ~2ms; fsync 2+2). - go test -race -count=1 ./internal/session/ ./internal/testutil/: green (inbox/drain/outbox/stophook + perf). Pre-existing sleep-timing flakes in the status/lifecycle tests reproduce identically on the clean tree and are unrelated to this change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds performance and I/O regression tests for the outbox drain path by introducing fsync instrumentation, a test hook seam in the inbox durability layer, and two perf tests that verify both walltime bounds and fsync count invariants. The test-perf Makefile target is expanded to discover these tests repository-wide. ChangesOutbox drain perf instrumentation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 7✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/session/inbox.go`:
- Around line 57-61: The current SetFsyncHookForTest swaps the global fsyncFile
seam unsafely which races with concurrent durable-write code like
writeFileDurable; update the comment to state the real constraint (durable-write
code must not run concurrently with the seam swap) and make the swap
concurrency-safe by protecting fsyncFile access: either change
SetFsyncHookForTest to swap via a package-level sync.Mutex (or use
atomic.Value/unsafe.Pointer) and ensure writeFileDurable reads the seam under
the same synchronization mechanism, or document that tests must synchronize with
that mutex before invoking durable-write paths; reference SetFsyncHookForTest,
fsyncFile and writeFileDurable when applying the change.
In `@internal/testutil/fsynccounter.go`:
- Around line 60-67: The current check conflates Stat errors with regular files;
update the fsync counting in the method that calls f.Stat() to treat a Stat
error separately by adding a new counter (e.g., c.unknownSyncs) to the
fsynccounter struct and incrementing it when f.Stat() returns an error, while
keeping the existing increments of c.dirSyncs and c.fileSyncs only when err ==
nil (dir vs non-dir); ensure you guard unknownSyncs with c.mu like the others
and add a short comment documenting that unknownSyncs represents cases where
Stat failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: bfaf94cf-8382-43f6-b6fa-9ca7f789850e
⛔ Files ignored due to path filters (1)
docs/perf-budget-suite.mdis excluded by!**/*.md
📒 Files selected for processing (4)
Makefileinternal/session/inbox.gointernal/session/outbox_drain_perf_test.gointernal/testutil/fsynccounter.go
| func SetFsyncHookForTest(fn func(*os.File) error) func() { | ||
| prev := fsyncFile | ||
| fsyncFile = fn | ||
| return func() { fsyncFile = prev } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify no tests in internal/session call t.Parallel, which would
# race with the global fsyncFile swap.
rg -nP --type=go -C2 '\bt\.Parallel\s*\(' internal/session/Repository: asheshgoplani/agent-deck
Length of output: 2890
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find fsyncFile declaration/uses =="
rg -n --type=go "var\s+fsyncFile|fsyncFile\s*=" internal/session/
echo
echo "== Find SetFsyncHookForTest call sites in internal/session tests =="
rg -n --type=go "SetFsyncHookForTest\s*\(" internal/session/
echo
echo "== Find where fsyncFile is passed/called =="
rg -n --type=go "\bfsyncFile\b" internal/session/Repository: asheshgoplani/agent-deck
Length of output: 1117
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Narrow to likely relevant files to avoid noise
echo "== List go files under internal/session =="
fd --type f --extension go internal/session/ | sort
echo
echo "== Inspect internal/session/inbox.go around fsyncFile + SetFsyncHookForTest =="
# show around line ~40-90; adjust if file shorter
sed -n '1,140p' internal/session/inbox.go | cat -nRepository: asheshgoplani/agent-deck
Length of output: 503
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find test files under internal/session and show t.Parallel presence
echo "== Go test files under internal/session =="
fd --type f --extension go internal/session/ -a | rg "_test\.go$" | sort
echo
echo "== Search for SetFsyncHookForTest in *_test.go =="
rg -n --type=go "SetFsyncHookForTest\s*\(" internal/session/ -g "*_test.go"
echo
echo "== Search for durable-write / fsync hooks usage hints =="
rg -n --type=go "fsync|Fsync|durable|durability" internal/session/Repository: asheshgoplani/agent-deck
Length of output: 503
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect tests that have t.Parallel and see if they touch inbox or durable-write paths.
# We'll just output surrounding blocks for t.Parallel occurrences in instance_test.go and other tests.
for f in $(fd --type f -g "*_test.go" internal/session/ | sort); do
echo "---- $f ----"
rg -n --type=go "\bt\.Parallel\s*\(" "$f" -n || true
echo
doneRepository: asheshgoplani/agent-deck
Length of output: 10525
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Specifically inspect internal/session/instance_test.go around the t.Parallel sections near the earlier lines.
sed -n '2920,3100p' internal/session/instance_test.go | cat -nRepository: asheshgoplani/agent-deck
Length of output: 6774
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== List go files under internal/session =="
fd --type f --extension go internal/session/ | sort
echo
echo "== Find fsyncFile declaration/uses =="
rg -n --type=go "var\s+fsyncFile|fsyncFile\s*=" internal/session/
echo
echo "== Find SetFsyncHookForTest call sites in internal/session (all go files) =="
rg -n --type=go "SetFsyncHookForTest\s*\(" internal/session/
echo
echo "== Find where fsyncFile is referenced (all go files) =="
rg -n --type=go "\bfsyncFile\b" internal/session/
echo
echo "== Read internal/session/inbox.go (first 160 lines) =="
sed -n '1,160p' internal/session/inbox.go | cat -n
echo
echo "== Search for fsync/durable hints across internal/session =="
rg -n --type=go "fsync|Fsync|durable|durability" internal/session/
echo
echo "== t.Parallel occurrences in *_test.go under internal/session =="
for f in $(fd --type f -g "*_test.go" internal/session/ | sort); do
echo "---- $f ----"
rg -n --type=go "\bt\.Parallel\s*\(" "$f" || true
done
echo
echo "== Inspect internal/session/instance_test.go around earlier t.Parallel lines =="
sed -n '2930,3105p' internal/session/instance_test.go | cat -nRepository: asheshgoplani/agent-deck
Length of output: 503
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show additional context around SetFsyncHookForTest definition in inbox.go
echo "== Context around SetFsyncHookForTest in inbox.go =="
rg -n --type=go "func\s+SetFsyncHookForTest" internal/session/inbox.go
line=$(rg -n --type=go "func\s+SetFsyncHookForTest" internal/session/inbox.go | head -n1 | cut -d: -f1)
start=$((line-20))
end=$((line+40))
sed -n "${start},${end}p" internal/session/inbox.go | cat -nRepository: asheshgoplani/agent-deck
Length of output: 3568
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect where fsyncFile is called in codebase for durability path.
# Show around each match of fsyncFile usage with context lines.
rg -n --type=go "\bfsyncFile\b" internal/session/ | awk -F: '{print $1}' | sort -u | while read -r f; do
echo "---- file: $f ----"
done
rg -n --type=go "\bfsyncFile\b" internal/session/ -C2Repository: asheshgoplani/agent-deck
Length of output: 1821
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find any tests that call SetFsyncHookForTest and show surrounding code.
# We'll print the enclosing function for each call by using rg with context.
rg -n --type=go "SetFsyncHookForTest\s*\(" internal/session -C3 -g "*_test.go"Repository: asheshgoplani/agent-deck
Length of output: 1227
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Also check if any test files are in package session vs session_test, which affects access to SetFsyncHookForTest.
for f in $(fd --type f -g "*_test.go" internal/session/ | sort); do
pkg=$(head -n 5 "$f" | rg -n "package " | head -n1 || true)
echo "$f: ${pkg:-package <unknown>}"
doneRepository: asheshgoplani/agent-deck
Length of output: 11744
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find fsyncFile declaration/uses =="
rg -n --type=go "var\s+fsyncFile|fsyncFile\s*=" internal/session/ || true
echo
echo "== Find SetFsyncHookForTest call sites (all go files) =="
rg -n --type=go "SetFsyncHookForTest\s*\(" internal/session/ || true
echo
echo "== Find where fsyncFile is referenced (all go files) =="
rg -n --type=go "\bfsyncFile\b" internal/session/ -C2 || true
echo
echo "== Read internal/session/inbox.go (first 200 lines) =="
sed -n '1,200p' internal/session/inbox.go | cat -n
echo
echo "== Search for fsync/durable hints across internal/session =="
rg -n --type=go "fsync|Fsync|durable|durability" internal/session/ || true
echo
echo "== t.Parallel occurrences in internal/session tests =="
rg -n --type=go "\bt\.Parallel\s*\(" internal/session/ -g "*_test.go" -C1 || true
echo
echo "== Inspect internal/session/instance_test.go around t.Parallel lines =="
sed -n '2920,3105p' internal/session/instance_test.go | cat -nRepository: asheshgoplani/agent-deck
Length of output: 36696
Fix the “package tests run serially” assumption in SetFsyncHookForTest
internal/session/inbox.goclaims tests are serial because perf tests don’t uset.Parallel, butinternal/session/instance_test.gohas multiplet.Parallel()tests/subtests (e.g.,TestWrapIgnoreSuspend,TestNewSandboxConfig), so the package isn’t fully serial.- Since
SetFsyncHookForTestswaps the globalfsyncFileseam andwriteFileDurableuses it, any futuret.Paralleltest that exercises durable-write paths would introduce a race with the global swap. - Update the comment to reflect the actual constraint (durable-write code must not run concurrently with the perf gate’s seam swap), and/or make the seam swap safe under concurrency (e.g., atomic/synchronized seam access).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/session/inbox.go` around lines 57 - 61, The current
SetFsyncHookForTest swaps the global fsyncFile seam unsafely which races with
concurrent durable-write code like writeFileDurable; update the comment to state
the real constraint (durable-write code must not run concurrently with the seam
swap) and make the swap concurrency-safe by protecting fsyncFile access: either
change SetFsyncHookForTest to swap via a package-level sync.Mutex (or use
atomic.Value/unsafe.Pointer) and ensure writeFileDurable reads the seam under
the same synchronization mechanism, or document that tests must synchronize with
that mutex before invoking durable-write paths; reference SetFsyncHookForTest,
fsyncFile and writeFileDurable when applying the change.
| if info, err := f.Stat(); err == nil && info.IsDir() { | ||
| c.mu.Lock() | ||
| c.dirSyncs++ | ||
| c.mu.Unlock() | ||
| } else { | ||
| c.mu.Lock() | ||
| c.fileSyncs++ | ||
| c.mu.Unlock() |
There was a problem hiding this comment.
Clarify or handle Stat error separately from file-type bucketing.
The current logic defaults to fileSyncs when f.Stat() fails. A Stat error (e.g., EBADF) doesn't indicate the descriptor's file-vs-directory type, so counting it as a file sync may misattribute the operation. Consider either:
- Skipping the count when
Statfails (document that as "unknown type"), or - Adding a third counter for "unknownSyncs", or
- Documenting explicitly that Stat failures default to file syncs as a benign simplification.
The current implementation works in practice (Stat on valid open descriptors won't fail in normal test scenarios), but the conflation of "not a directory" and "Stat error" is subtle.
🛡️ Proposed fix to handle Stat errors explicitly
- if info, err := f.Stat(); err == nil && info.IsDir() {
+ info, err := f.Stat()
+ if err != nil {
+ // Stat failed; can't determine file vs dir. Skip counting, or
+ // default to fileSyncs as a simplification (document the choice).
+ // For now, skip to avoid misattribution:
+ return real(f)
+ }
+ if info.IsDir() {
c.mu.Lock()
c.dirSyncs++
c.mu.Unlock()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/testutil/fsynccounter.go` around lines 60 - 67, The current check
conflates Stat errors with regular files; update the fsync counting in the
method that calls f.Stat() to treat a Stat error separately by adding a new
counter (e.g., c.unknownSyncs) to the fsynccounter struct and incrementing it
when f.Stat() returns an error, while keeping the existing increments of
c.dirSyncs and c.fileSyncs only when err == nil (dir vs non-dir); ensure you
guard unknownSyncs with c.mu like the others and add a short comment documenting
that unknownSyncs represents cases where Stat failed.
|
I'm ignoring the playwright test breakage, as an investigation shows they were broken for a while. |
Summary
Lands the first hard-gated
TestPerf_*for the durable per-parent outbox DRAIN —DrainInboxForParent, which runs on every conductor Stop hook and every heartbeat (DrainForStopHookandagent-deck inbox drain self), introduced in #1226 and activated in #1227. It was the highest-ROI item on the deferred list in #790/#1047: an unconditionally-hot lifecycle path with zero walltime gating.Authority for everything below is
docs/perf-budget-suite.md(the relocated #790/#1047 mandate): cold/warm classification, theWarmBudgetformula, the n=11 trimmed mean, the 1ms floor, and the Tier 1 / Tier 2 split for I/O-touching tests.What the drain does (path under test)
DrainInboxForParentis purely in-process and tmux-free. Two-phase, crash-safe:writeFileDurable), then truncate the inbox.writeFileDurable), drop the WAL.The durable backing is
*os.File(writeFileDurable: fsync data file → atomic rename → best-effort fsync parent dir), so both perf-budget tiers apply.The two gates (one per tier)
Tier 1 —
TestPerf_OutboxDrain_Drain25(WARM walltime)TrimmedMeanWithSetup:setup()rebuilds a fresh inbox of N=25 distinct-child records and wipes the WAL + consumed ledger each iteration (so every timed drain hits the identical steady-state path);op()timesDrainInboxForParentonly.TestIssue1225_DrainInbox_ConcurrentDrainNoDoubleNoLossfixes atconst n = 25; this reuseschildIDForN, the same id generator.t.TempDirbeing tmpfs, which it isn't on e.g. WSL2/ext4). WARM GC discipline is composed ontoTrimmedMeanWithSetup(auto-GC off + forced collection insidesetup()), exactly asperfbudget.goinstructs.t.TempDirviainboxTestHome(HOME override).Tier 2 —
TestPerf_OutboxDrain_FsyncCount(count assertion)Disk fsync latency is un-normalizable (~100× across SSD/HDD/cloud), so the I/O-pattern regression is gated on counts, per the doc. Derived by reading the drain code, not recording observed values: a non-empty drain of distinct-child, not-yet-consumed records issues exactly 2 file fsyncs + 2 directory fsyncs (WAL stage + ledger persist, each batched into one
writeFileDurable), independent of N. The test asserts the count is identical for N=1 and N=25 — a fsync-per-record regression would scale the file count with N. (The inbox truncate and WAL drop areos.Remove, no fsync.)Supporting infrastructure
internal/session/inbox.go— adds thefsyncFileseam (default(*os.File).Sync, zero production overhead) routingwriteFileDurable+fsyncDir, plusSetFsyncHookForTest. This is the*os.Fileseam the doc explicitly anticipates for Tier 2 gates. No production behavior change.internal/testutil/fsynccounter.go— reusableFsyncCounterwrapper bucketing file vs directory syncs. Landed as shared testutil (not a one-off) because the doc frames the wrapper as infrastructure for every future persistence-touching Tier 2 gate.Makefile+docs/perf-budget-suite.md— alignmake test-perfscope to./...(matching.github/workflows/perf-smoke.yml, which already uses./...) so aTestPerf_*in any package — like thisinternal/sessiongate — is exercised, not justcmd/agent-deck.Observed local median & budgets
Measured under
-race,PERF_BUDGET_MULTIPLIER=1.0, Linux WSL2 container, N=25, fsync no-op'd (disk-independent):baseconstant2.1msbase × 3= 6.3msbase × 3 × 2= 12.6msThe
basecites the observed median directly (per the mandate), with the WARM 3× factor for residual warm-host jitter.Verification (Linux WSL2; template per #1047)
make test-perf(mult 1.0) — outbox gatesPERF_BUDGET_MULTIPLIER=2.0 make test-perf— outbox gatesgo test -race -count=1 ./internal/session/(inbox/drain/outbox/stophook + perf)go test -race -count=1 ./internal/testutil/go build ./...gofmt -l(changed files)Known caveats (both pre-existing, both proven on the clean tree with this PR stashed — not introduced here):
TestPerf_ColdStart_{Help,Version}fail locally on this WSL2 box (cold start ~300–330ms vs the 40ms budget) — a slow-process-spawn environment artifact; the cold-start budget holds on CI'subuntu-latest. Untouched per the non-goals.internal/sessionstatus/lifecyclesleep-timing tests (TestStatusCycle_ShellSessionWithCommand,TestLifecycle_StoppedRestartedRunningError) flake under the loaded full-package-racerun; they pass in isolation and fail identically on the clean tree. Unrelated to the fsync seam.Non-goals honored
No cold-start changes; no existing budget widened; no benchstat/trending infra; no real tmux in any
TestPerf_*; the 1ms floor and n=11 convention are untouched.🤖 Generated with Claude Code
Summary by CodeRabbit
Tests
Chores