Skip to content

feat(agent-dashboard): persist hook status across Orca restart#1480

Merged
brennanb2025 merged 9 commits into
mainfrom
brennanb2025/agent-status-preserve-restart
May 11, 2026
Merged

feat(agent-dashboard): persist hook status across Orca restart#1480
brennanb2025 merged 9 commits into
mainfrom
brennanb2025/agent-status-preserve-restart

Conversation

@brennanb2025
Copy link
Copy Markdown
Contributor

@brennanb2025 brennanb2025 commented May 6, 2026

Summary

  • Persists the hook server's per-pane lastStatusByPaneKey to userData/agent-hooks/last-status.json (atomic write, 250ms trailing debounce, sync flush on stop()). Fixes done, blocked, and quiet working rows winking out after restart.
  • Adds an agentStatus:drop IPC so renderer dismissals (dropAgentStatus, dismissRetainedAgentsByWorktree) propagate to the main-process cache and the on-disk file, preventing dismissed rows from resurrecting on relaunch.
  • Adds a bounded bootstrap queue in useIpcEvents so setListener() replay during window creation isn't dropped while App.tsx is still hydrating tabsByWorktree; drains on the workspaceSessionReady false→true transition.
  • Hydrate path now cleans the stale on-disk file synchronously when sanitize drops entries (drift, TTL, schema mismatch), and logs a single [agent-hooks] last-status hydrate dropped N entries (kept M) warn for visibility. Pre-fix, corrupt/stale entries stayed on disk until a fresh hook event triggered a debounced write.
  • Post-merge with main: persistence/hydration are unconditional (the old experimentalAgentDashboard runtime gate is gone — only remains in persistence migration code). agentStatus:set is forwarded from main unconditionally with receivedAt and stateStartedAt.

Test plan

  • pnpm typecheck (pnpm run tc:node, pnpm run tc:web)
  • pnpm test src/main/agent-hooks/server.test.ts (persistence/hydration tests)
  • pnpm test src/main/ipc/agent-hooks.test.ts (IPC handler tests)
  • pnpm test src/renderer/src/store/slices/agent-status-drop-ipc.test.ts (slice IPC fan-out)
  • pnpm test src/renderer/src/hooks/agent-status-bootstrap-queue.test.ts (queue/drain/cap)
  • Existing useIpcEvents.test.ts mocks updated for useAppStore.subscribe
  • Focused agent-status test suite: 114 tests pass post-merge
  • server.test.ts + agent-hooks.test.ts: 80/80 pass after hydrate-cleanup fix

Manual — verified via synthetic hook POSTs + CDP introspection on a dev build

Driven by sending real Claude-shaped payloads to the loopback hook server, then reading last-status.json and window.api.agentStatus.getSnapshot() to confirm behavior. Quit/relaunch was the actual quit/relaunch of an Electron dev instance.

  • Synthetic event roundtrip: POST /hook/claude with a Stop payload writes a v2 envelope at last-status.json (mode 0o600); getSnapshot() returns the entry with receivedAt/stateStartedAt.
  • agentStatus:drop IPC: renderer call removes the entry from both the snapshot and the on-disk file within ~250ms.
  • Live overwrite of hydrated row: hydrated done entry + live UserPromptSubmit POST for the same paneKey → entry transitions to working with the new prompt and fresh receivedAt; on-disk file follows; no duplicate.
  • Corrupt JSON: truncated file → single [agent-hooks] last-status file is not valid JSON; ignoring warn; empty snapshot; dashboard renders normally.
  • Missing file: cold start with no file → no warns; empty snapshot.
  • Old version field: "version": 1 envelope → version mismatch (1 != 2); ignoring; empty snapshot.
  • tabId/paneKey drift rejected (with disk cleanup): file with one good entry + one entry whose tabId disagrees with the paneKey prefix → only the good one hydrates; drift entry purged from last-status.json synchronously during hydrate; single dropped 1 entries (kept 1) warn.
  • TTL cap (with disk cleanup): 5 fresh entries + 5 entries with receivedAt 10 days old → only the 5 fresh entries hydrate (HYDRATE_MAX_AGE_MS = 7d); 5 stale entries purged from last-status.json synchronously during hydrate; single dropped 5 entries (kept 5) warn.

Manual — acknowledgedAgentsByPaneKey persistence + hydrate sanitizer (this branch's last two commits)

Verified the latest changes (d6827bd3 + bff99897) on a dev build with an isolated ORCA_DEV_USER_DATA_PATH=/tmp/orca-ack-restart-test, driving acknowledgeAgents and the persistence pipeline through CDP and inspecting orca-data.json directly.

  • Fresh cold start: getDefaultUIState() writes ui.acknowledgedAgentsByPaneKey: {} to orca-data.json; in-memory map matches.
  • Disk write through the existing debounced save: acknowledgeAgents(['tab-test-1:0', 'tab-test-2:0', 'tab-test-3:1']) → after the 150ms App.tsx debounce + 300ms persistence debounce, all 3 keys land under ui.acknowledgedAgentsByPaneKey with valid timestamps. Confirms the new field is wired into the existing window.api.ui.set effect.
  • Round-trip restore across quit/relaunch: quit the dev process, relaunch with the same userData → hydratePersistedUI restores the exact same 3 keys with their pre-quit timestamps.
  • Hydrate sanitizer drops bad/stale shapes: hand-edited orca-data.json to inject a mix of valid + malicious entries (TTL-expired @ 8d, negative, zero, non-numeric, __proto__ / constructor / prototype keys, plus 2 valid entries within the 7d TTL). After relaunch, only the 2 valid entries hydrated. No Object.prototype pollution from the __proto__ key.
  • On-disk cleanup follows hydrate: triggering one new ack post-hydrate caused the next ui:set write to overwrite orca-data.json with only the sanitized + new entries — all injected garbage was gone from disk.

Manual — still requires a real agent to verify end-to-end (not yet done)

  • Real done agent restart: row reappears within first dashboard frame.
  • Real blocked Claude restart: row reappears with prompt + tool name.
  • Real quiet working restart: last-known row reappears; updates naturally on next event.
  • Dismiss stickiness via the X button across restart.
  • Worktree archive purges entries from last-status.json across restart.
  • Startup timing across many restored tabs/worktrees: no orphan rows for missing tabs.
  • Cursor/opencode sidebar title/status indicators still animate from hook state.

Made with Orca 🐋

brennanb2025 and others added 9 commits May 5, 2026 17:45
Hydrates the hook server's per-pane lastStatusByPaneKey from
userData/agent-hooks/last-status.json before binding the HTTP listener,
mirrors mutations to disk via a 250ms trailing debounce, and flushes
synchronously on stop(). Renderer dismissals fan out a new
agentStatus:drop IPC so the on-disk file evicts the entry and a
relaunch cannot resurrect it. Adds a bounded bootstrap queue in
useIpcEvents so events replayed by setListener() during window creation
are not dropped while App.tsx is still hydrating tabsByWorktree.

Gated on settings.experimentalAgentDashboard. Done, blocked, and quiet
working rows now all survive across restart.

Co-authored-by: Orca <help@stably.ai>
Address review findings on the retention-restart branch:

- Wrap agentStatus:getSnapshot and agentStatus:drop IPC handlers in
  try/catch so a throw cannot surface as an unhandled invoke rejection
  (silent startup-hydration failure) or crash main from a fire-and-
  forget listener.
- runStatusPersist no longer permanently suppresses gate-off deletion
  retries on transient unlink errors (e.g. EPERM); deletedOnDisable
  now flips only on success or ENOENT.
- Tighten tests: stale-version-hydrate now asserts the warn message
  content; getSnapshot test uses toEqual; drop-handler test rejects
  null/{}/[] in addition to the prior bad inputs.

Co-authored-by: Orca <help@stably.ai>
…aneKey drift

- Drop hydrate entries older than 7 days (HYDRATE_MAX_AGE_MS) so stale
  rows from worktrees archived weeks ago do not pile up forever. PTY-
  teardown eviction handles closed panes; the TTL covers daemon-restored
  PTYs that never re-attach and crash-recovery paths.
- Reject hydrate entries whose `tabId` field diverges from the paneKey's
  tab segment. Cheap defensive add against future renamer/shape drift.

Doc updated to move TTL out of the follow-ups list (now in scope).
Tests: new "drops hydrate entries older than the TTL cutoff" and "drops
a hydrate entry whose tabId disagrees with the paneKey prefix"; existing
hydrate fixtures now use a `recentTs()` helper instead of fixed 2023
timestamps.

Co-authored-by: Orca <help@stably.ai>
Apply review-fix corrections on the agent-dashboard restart-persistence
work:

- Split dropStatusEntry from clearPaneState so renderer-driven dismiss
  IPC no longer wipes lastPromptByPaneKey/lastToolByPaneKey for a
  still-alive pane.
- Validate paneKey shape at the IPC boundary (isValidPaneKey).
- Let getSnapshot errors propagate instead of silently returning [] —
  matches the renderer's existing .catch and avoids masking a broken
  persistence path.
- Trust main's authoritative timing.stateStartedAt unconditionally on
  same-state pings; fall back to existing only when timing is absent.
- Use strict < on the snapshot/live updatedAt guard so two events in
  the same millisecond don't drop the second one (a <= guard regressed
  two existing slice tests).
- Don't reset snapshotRequestedForReadyWindow in the catch handler;
  combined with the per-store-update subscriber it would retry-storm
  on persistent IPC failure.
- scheduleStatusPersist now resets the timer on each call (true
  trailing-edge debounce) instead of leading-edge throttle.
- Fix doc references that named clearPaneState in dismiss/IPC context
  where the implementation uses dropStatusEntry; add type-level JSDoc
  on AgentStatusIpcPayload.

109/109 in-scope tests pass.

Co-authored-by: Orca <help@stably.ai>
…atus-preserve-restart

# Conflicts:
#	src/main/index.ts
#	src/renderer/src/hooks/useIpcEvents.test.ts
#	src/renderer/src/hooks/useIpcEvents.ts
- Defensive `lastStatusByPaneKey.clear()` at top of `hydrateLastStatusFromDisk` keeps repeat-start() calls from silently merging prior-session state.
- When sanitize drops entries (drift, TTL, schema), log a single `[agent-hooks] last-status hydrate dropped N entries (kept M)` warn and synchronously rewrite the file. Pre-fix, stale entries stayed on disk until a fresh hook event triggered a debounced write — users who hadn't run an agent in 8+ days would re-drop the same entries every cold boot.
- Prime `lastWrittenJson` from the raw on-disk bytes (instead of re-serializing) when hydration is lossless — robust against future shape drift in `serializeStatusFile`.
- `LAST_STATUS_FILE_VERSION = 2` comment now records why v1 was skipped (in-flight branch shape).
- IPC test mock uses `vi.importActual` for `isValidPaneKey` so it stays in sync with the real validator.

Co-authored-by: Orca <help@stably.ai>
Without this, agent rows the user already visited come back bold every relaunch now that the rows themselves survive restart (per docs/agent-dashboard-retention-restart.md). Hydrate sanitizes input field-by-field (rejects null/non-object/array, prototype-pollution keys, non-finite/non-positive values) and applies a 7-day TTL paralleling HYDRATE_MAX_AGE_MS in agent-hooks/server.ts so hard-quit/crash paths can't grow the persisted map forever.

Co-authored-by: Orca <help@stably.ai>
Doc was a working artifact for this branch; the rationale lives in commit
history and the comments next to the persistence/hydrate code. Scrubs the
three call-site references that named it.

Co-authored-by: Orca <help@stably.ai>
Resolves the agent-hooks refactor collision from #1678 (shared listener +
relay adapter). The refactor extracted listener internals into
`src/shared/agent-hook-listener.ts` and replaced server.ts's module-level
`lastStatusByPaneKey` Map with `state.lastStatusByPaneKey` on a shared
`HookListenerState`. This branch's persistence layer is re-anchored on the
new shape:

- `AgentStatusIpcPayload` now carries `connectionId: string | null` (from
  main) alongside `receivedAt` / `stateStartedAt` (from this branch).
- `src/main/agent-hooks/server.ts` rewritten as the slim adapter
  (~720 LoC) over the shared listener: defines a server-process-only
  `EnrichedAgentHookEventPayload = AgentHookEventPayload & {receivedAt,
  stateStartedAt}` stored in `state.lastStatusByPaneKey` (the shared module
  never reads this map, so the extra fields ride along untouched), keeps
  `last-status.json` v2 hydrate / sanitize / TTL / atomic-write / drop
  semantics. The new HTTP and `ingestRemote` paths both run through
  `attachStatusTiming` before caching.
- `src/main/index.ts` IPC fanout forwards the union: connectionId +
  receivedAt + stateStartedAt + ...payload.
- `src/preload/{api-types,index}.ts` keep the typed `AgentStatusIpcPayload`
  surface (which now subsumes both branches' fields).
- `src/main/agent-hooks/server.test.ts`: persistence tests preserved as-is;
  ingestRemote tests from main re-laxed from `toHaveBeenCalledWith({...})` to
  `expect.objectContaining({...})` so the listener's enriched payload doesn't
  fail strict equality.

Verified: pnpm tc:node + tc:web clean; 776/776 vitest tests pass across
agent-hooks, shared listener, relay, IPC handler, and renderer agent-status
slice + ui slice. tc:cli has pre-existing TS6307 errors on origin/main that
are not introduced by this merge.

Co-authored-by: Orca <help@stably.ai>
@brennanb2025 brennanb2025 merged commit 9b8324e into main May 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant