Skip to content

test(perf): Tier 1 WARM suite — statedb CRUD + storage-mediated group lifecycle#1234

Open
JMBattista wants to merge 2 commits into
asheshgoplani:mainfrom
JMBattista:feature/tier1-warm-suite
Open

test(perf): Tier 1 WARM suite — statedb CRUD + storage-mediated group lifecycle#1234
JMBattista wants to merge 2 commits into
asheshgoplani:mainfrom
JMBattista:feature/tier1-warm-suite

Conversation

@JMBattista
Copy link
Copy Markdown
Contributor

@JMBattista JMBattista commented May 31, 2026

Tier 1 WARM perf suite — deferred items from #790 / #1047

Builds out a basic suite of Tier 1 (tmpfs walltime) TestPerf_* gates for the deferred persistence-touching lifecycle paths called out in PR #790 and #1047, per the mandate in docs/perf-budget-suite.md — specifically the "Tier 1 vs Tier 2" section.

All new tests:

  • Run against t.TempDir() (tmpfs on CI Linux) — fsync is effectively free, so the measured cost is CPU + Go-runtime cost only. No mkdir under repo paths.
  • Classify as WARM (pure in-process Go against an SQLite handle on tmpfs — no process boundary, no real-disk fsync, no child spawn, no network), with a one-line "why WARM not COLD" justification in each file header.
  • Use testutil.WarmBudget + testutil.TrimmedMeanWarm (or TrimmedMeanWithSetup where a per-iteration fixture rebuild is excluded from timing).
  • Call testutil.SkipIfShort(t).
  • Cite the local median measured under -race at PERF_BUDGET_MULTIPLIER=1.0 next to the base constant.

No real tmux is touched. No network. Pure-Go in-process SQLite only. (internal/session Storage is constructed directly against a tmpfs DB; seeded rows carry TmuxSession="" so the lazy-reconnect branch in LoadWithGroups never executes. internal/statedb opens the DB directly.)

New TestPerf_* tests

Local medians measured under -race on a WSL2 dev host (the CI Xeon container will differ; budgets carry the ×3 WARM / ×6 CI headroom that absorbs this).

Test Path exercised Local median (-race, ×1.0) Base Budget @1.0 Budget @2.0 (CI)
TestPerf_StateDB_InsertN 50× SaveInstance into empty table (per-session insert) ~206 ms 210 ms 630 ms 1.26 s
TestPerf_StateDB_ReadByID 50× InstanceExists (rm-verify point query, #909) ~9.3 ms 10 ms 30 ms 60 ms
TestPerf_StateDB_List one LoadInstances over 50 rows (reload scan) ~6.9 ms 7 ms 21 ms 42 ms
TestPerf_StateDB_DeleteN 50× DeleteInstance (agent-deck rm xargs path, #909) ~166 ms 170 ms 510 ms 1.02 s
TestPerf_StateDB_WatcherEventIngest ingest 50 events at 500-event steady state (SaveWatcherEvent insert + prune DELETE … NOT IN) ~1.42 s 1.45 s 4.35 s 8.7 s
TestPerf_Group_CreateDelete real handleGroupCreate/handleGroupDelete choreography: LoadWithGroupsNewGroupTreeWithGroups → create/delete → SaveWithGroups (×2), over a 50-instance deck ~67 ms 70 ms 210 ms 420 ms

WarmBudget = max(base × 3, 1 ms) × PERF_BUDGET_MULTIPLIER. None of the bases hit the 1 ms floor.

Coverage rationale

  1. StateDB CRUD scaling (internal/statedb/statedb_perf_test.go) — insert / read-by-id / list / delete, each its own test, at N = 50. These deliberately gate the storage primitives (the task framed this as "StateDB CRUD scaling"), not an assembled flow. 50 is the representative upper-end deck size cited in code (internal/web/handlers_costs.go:529: the sidebar "grows past ~50 sessions"; llms-full.txt cites 30-session workspaces) — not an invented number. DB path is t.TempDir().

  2. Group lifecycle, storage-mediated (internal/session/group_perf_test.go) — test(perf): add walltime regression suite for cold start and group lifecycle #790 explicitly carved group create/delete out because doing it in-memory "exercises the wrong layer" (map ops are nanoseconds). This test gates it by reconstructing the exact storage choreography the real handlers perform (cmd/agent-deck/group_cmd.go handleGroupCreate ~L396-442 / handleGroupDelete ~L689-705):

    storage.LoadWithGroups()                     // full read of all instances + groups
    session.NewGroupTreeWithGroups(insts, grps)  // rebuild the tree
    groupTree.CreateGroup / DeleteGroup          // the mutation
    storage.SaveWithGroups(insts, groupTree)     // rewrite the WHOLE instances table + groups + Touch() + dedup
    

    Deliberately not Storage.SaveGroupsOnly — that is the lightweight expand/collapse visual-state path (its own doc comment says so) and skips both the instances round-trip and Touch(). The dominant cost in the real path is the instances-table read+rewrite, which is why this runs against a populated 50-instance deck rather than an empty DB.

    Out of scope (follow-up): the literal CLI shell — flag parsing, NewStorageWithProfile, out.Success/os.Exit. Driving that in-process needs the same injection seam deferred for session create/delete (below). The seam-free remainder carries no meaningful, tmux-touching cost, so the storage choreography above is the faithful Tier-1 target.

  3. Watcher-event ingestion (internal/statedb) — a distinct persistence lifecycle: SaveWatcherEvent does INSERT OR IGNORE then a prune DELETE … WHERE id NOT IN (SELECT … LIMIT N). The prune subquery cost scales with steady-state table size, so the table is pre-filled to the default 500-event bound (max_events_per_watcher default 500, internal/session/userconfig.go:3419) and a 50-event batch is timed on top.

Tooling

  • Broadens make test-perf from ./cmd/agent-deck/... to ./... so these internal-package gates run locally. This matches .github/workflows/perf-smoke.yml, whose perf-tests job already runs ./..., so no CI workflow change is needed — the new tests are picked up automatically.

Verification

  • make test-perf at PERF_BUDGET_MULTIPLIER=1.0 and 2.0: all six new tests green (verified at both multipliers).
  • go test -race -count=1 ./internal/statedb/: green.
  • The six new tests pass under -race at both multipliers with comfortable margins.

Note: this dev host (WSL2) cannot pass the pre-existing cold-start gates (TestPerf_ColdStart_*, ~300 ms process spawn vs a 40 ms budget calibrated on the CI Xeon) or three pre-existing internal/session shell/lifecycle timing tests (TestStatusCycle_ShellSessionWithCommand, TestLifecycle_StoppedRestartedRunningError, TestBuildCodexCommand_CustomWrapperPreservesToolIdentity). All of these fail identically with this PR's files removed — they are host-specific and untouched here. No cold-start test or existing budget was modified.

Non-goals (per task scope)

  • No Tier 2 count assertions. A sibling session owns the outbox/drain Tier 2 work. Where a count gate clearly belongs (the watcher-event insert+prune path), a // TODO(tier2) with a one-line rationale is left in place.
  • Session create/delete Tier 1 gating is intentionally not included: it requires an injection seam to avoid real tmux/process spawn, which is a separate, larger piece of work. (The group test's CLI-shell wrapper shares this seam dependency.) Left as a follow-up.
  • No cold-start changes, no budget widening, no benchstat/trending infra.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Performance testing expanded to run locally across the entire module, matching CI behavior
    • Added performance benchmarks for group lifecycle operations (create, persist, delete cycles)
    • Added performance benchmarks for state database operations (insert, read by ID, list, delete, and event ingestion)
  • Chores
    • Adjusted local perf test target to cover all packages (broader scope)

…ycle

Adds the deferred Tier 1 (tmpfs walltime) perf gates from PR asheshgoplani#790/asheshgoplani#1047
that suit CPU/Go-runtime cost measurement, per docs/perf-budget-suite.md
("Tier 1 vs Tier 2"). All run against t.TempDir() (tmpfs on CI Linux),
classify as WARM, and use WarmBudget + TrimmedMeanWarm / TrimmedMeanWithSetup.

New tests (local medians under -race @ multiplier=1.0):
- internal/statedb (StateDB CRUD scaling, N=50 cited deck size):
  - TestPerf_StateDB_InsertN          ~206ms
  - TestPerf_StateDB_ReadByID         ~9.3ms
  - TestPerf_StateDB_List             ~6.9ms
  - TestPerf_StateDB_DeleteN          ~166ms
  - TestPerf_StateDB_WatcherEventIngest ~1.42s (insert+prune at 500-event steady state)
- internal/session (storage-mediated group lifecycle, asheshgoplani#790's carve-out):
  - TestPerf_Group_CreateDelete       ~8.5ms (SaveGroupsOnly round trip)

Each cites its measured median next to the base constant and a file-level
WARM-not-COLD justification. No real tmux, no network, pure-Go in-process
SQLite on tmpfs.

Also broadens `make test-perf` from ./cmd/agent-deck/... to ./... so these
internal-package gates run locally, matching the perf-smoke.yml CI gate
which already runs the whole module.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 31, 2026 16:59
@JMBattista JMBattista requested a review from asheshgoplani as a code owner May 31, 2026 16:59
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b7f80ce-a58a-4288-8f7e-834d436fb8e7

📥 Commits

Reviewing files that changed from the base of the PR and between c00eb8b and c1d0f15.

📒 Files selected for processing (1)
  • internal/session/group_perf_test.go

📝 Walkthrough

Walkthrough

Adds Tier-1 warm performance tests: five StateDB perf gates (insert, read-by-ID, list, delete, watcher-event ingest), one Session group create-delete perf test, and expands Makefile test-perf to run perf tests across the whole module (./...).

Changes

Performance Test Suite Expansion

Layer / File(s) Summary
Makefile test-perf target expansion
Makefile
Makefile test-perf target scope expanded from ./cmd/agent-deck/... to ./... with comments clarifying Tier-1 test coverage alignment with CI perf-smoke gate.
StateDB perf test infrastructure and CRUD tests
internal/statedb/statedb_perf_test.go
Defines perf-gate constants, implements newPerfDB, perfRows, and clearInstances helpers for isolated SQLite-backed fixtures, then adds CRUD perf gates: TestPerf_StateDB_InsertN, TestPerf_StateDB_ReadByID, TestPerf_StateDB_List, TestPerf_StateDB_DeleteN measured via testutil trimmed-mean warm budgets.
StateDB watcher event steady-state performance test
internal/statedb/statedb_perf_test.go
Defines watcher steady-state and batch constants and implements TestPerf_StateDB_WatcherEventIngest, which pre-fills watcher_events then measures batched SaveWatcherEvent insert+prune behavior against a warm budget.
Session group create-delete performance test
internal/session/group_perf_test.go
Adds TestPerf_Group_CreateDelete that seeds a deck, then times repeated LoadWithGroups → CreateGroup/DeleteGroup → SaveWithGroups cycles on an isolated migrated SQLite Storage, enforcing a trimmed-mean warm budget.

Sequence Diagram(s)

sequenceDiagram
  participant Test as TestPerf_StateDB_*
  participant StateDB as StateDB
  participant Testutil as testutil.TrimmedMean*
  Test->>StateDB: newPerfDB (migrate, open temp DB)
  Test->>StateDB: perfRows / SaveInstance / LoadInstances / DeleteInstance (setup)
  Test->>Testutil: TrimmedMean* (measure)
  Testutil->>StateDB: invoke operation (timed call)
  Testutil-->>Test: trimmed mean result
  Test->>Test: compare against WarmBudget
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 7
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title follows the Conventional Commits format with the 'test:' prefix and clearly summarizes the changes: adding Tier 1 WARM performance tests for statedb CRUD operations and storage-mediated group lifecycle.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Remote_parity ✅ Passed PR does not modify TUI code in internal/ui/ or cmd/agent-deck/; it only adds test files (group_perf_test.go, statedb_perf_test.go) and updates Makefile test scope. Check not applicable.
Test_coverage_per_surface ✅ Passed PR adds only performance test infrastructure for existing internal components; no new user-facing features introduced, so surface coverage requirement is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Tier-1 WARM performance gate tests for SQLite-backed persistence paths (StateDB CRUD, watcher event ingestion, and storage-mediated group lifecycle), and updates the local perf test make target to execute perf gates across all packages.

Changes:

  • Added StateDB Tier-1 perf tests covering insert/read/list/delete and watcher event ingest+prune paths.
  • Added Session/Storage Tier-1 perf test covering group create→persist→delete→persist lifecycle.
  • Updated make test-perf to run TestPerf_* across ./... so new perf tests in any package run locally like CI.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
internal/statedb/statedb_perf_test.go Introduces Tier-1 WARM perf gates for StateDB CRUD and watcher event ingest+prune.
internal/session/group_perf_test.go Adds Tier-1 WARM perf gate for storage-mediated group create/delete persistence round trip.
Makefile Expands test-perf scope to run perf tests across the whole module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +269 to +286
tx, err := db.DB().Begin()
if err != nil {
t.Fatalf("begin prefill: %v", err)
}
stmt, err := tx.Prepare(`INSERT INTO watcher_events (watcher_id, dedup_key, created_at) VALUES (?, ?, ?)`)
if err != nil {
t.Fatalf("prepare prefill: %v", err)
}
now := time.Now().Unix()
for i := 0; i < perfWatcherSteadyState; i++ {
if _, err := stmt.Exec(watcherID, fmt.Sprintf("base-%05d", i), now+int64(i)); err != nil {
t.Fatalf("prefill exec: %v", err)
}
}
_ = stmt.Close()
if err := tx.Commit(); err != nil {
t.Fatalf("commit prefill: %v", err)
}
…aphy

The first version called Storage.SaveGroupsOnly — but that is the lightweight
expand/collapse visual-state path (skips the instances round-trip and Touch()).
The actual user-facing handlers (cmd/agent-deck/group_cmd.go handleGroupCreate /
handleGroupDelete) instead do:

  LoadWithGroups → NewGroupTreeWithGroups → CreateGroup/DeleteGroup → SaveWithGroups

where the dominant cost is the full instances-table read + rewrite (+ Touch +
dedup). The old test gated a cheaper sibling op and would miss regressions in
the real path.

Rewrites TestPerf_Group_CreateDelete to reconstruct that exact choreography
against a populated 50-instance deck (the cited upper-end size). Seeded rows
carry TmuxSession="" so LoadWithGroups' lazy tmux.ReconnectSessionLazy branch
never runs — still zero tmux code. Net-zero per iteration (create then delete
the same ephemeral group), so TrimmedMeanWarm still applies.

Local median under -race jumps ~8.5ms → ~67ms, confirming the old test
measured a fraction of the real cost. Base 70ms → WarmBudget 210ms local /
420ms CI.

The CLI shell itself (flag parse, NewStorageWithProfile, out.Success/os.Exit)
needs the deferred injection seam and is out of scope; the seam-free remainder
carries no meaningful, tmux-touching cost.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants