feat(agent-dashboard): persist hook status across Orca restart#1480
Merged
Conversation
Hydrates the hook server's per-pane lastStatusByPaneKey from userData/agent-hooks/last-status.json before binding the HTTP listener, mirrors mutations to disk via a 250ms trailing debounce, and flushes synchronously on stop(). Renderer dismissals fan out a new agentStatus:drop IPC so the on-disk file evicts the entry and a relaunch cannot resurrect it. Adds a bounded bootstrap queue in useIpcEvents so events replayed by setListener() during window creation are not dropped while App.tsx is still hydrating tabsByWorktree. Gated on settings.experimentalAgentDashboard. Done, blocked, and quiet working rows now all survive across restart. Co-authored-by: Orca <help@stably.ai>
Address review findings on the retention-restart branch:
- Wrap agentStatus:getSnapshot and agentStatus:drop IPC handlers in
try/catch so a throw cannot surface as an unhandled invoke rejection
(silent startup-hydration failure) or crash main from a fire-and-
forget listener.
- runStatusPersist no longer permanently suppresses gate-off deletion
retries on transient unlink errors (e.g. EPERM); deletedOnDisable
now flips only on success or ENOENT.
- Tighten tests: stale-version-hydrate now asserts the warn message
content; getSnapshot test uses toEqual; drop-handler test rejects
null/{}/[] in addition to the prior bad inputs.
Co-authored-by: Orca <help@stably.ai>
…aneKey drift - Drop hydrate entries older than 7 days (HYDRATE_MAX_AGE_MS) so stale rows from worktrees archived weeks ago do not pile up forever. PTY- teardown eviction handles closed panes; the TTL covers daemon-restored PTYs that never re-attach and crash-recovery paths. - Reject hydrate entries whose `tabId` field diverges from the paneKey's tab segment. Cheap defensive add against future renamer/shape drift. Doc updated to move TTL out of the follow-ups list (now in scope). Tests: new "drops hydrate entries older than the TTL cutoff" and "drops a hydrate entry whose tabId disagrees with the paneKey prefix"; existing hydrate fixtures now use a `recentTs()` helper instead of fixed 2023 timestamps. Co-authored-by: Orca <help@stably.ai>
Apply review-fix corrections on the agent-dashboard restart-persistence work: - Split dropStatusEntry from clearPaneState so renderer-driven dismiss IPC no longer wipes lastPromptByPaneKey/lastToolByPaneKey for a still-alive pane. - Validate paneKey shape at the IPC boundary (isValidPaneKey). - Let getSnapshot errors propagate instead of silently returning [] — matches the renderer's existing .catch and avoids masking a broken persistence path. - Trust main's authoritative timing.stateStartedAt unconditionally on same-state pings; fall back to existing only when timing is absent. - Use strict < on the snapshot/live updatedAt guard so two events in the same millisecond don't drop the second one (a <= guard regressed two existing slice tests). - Don't reset snapshotRequestedForReadyWindow in the catch handler; combined with the per-store-update subscriber it would retry-storm on persistent IPC failure. - scheduleStatusPersist now resets the timer on each call (true trailing-edge debounce) instead of leading-edge throttle. - Fix doc references that named clearPaneState in dismiss/IPC context where the implementation uses dropStatusEntry; add type-level JSDoc on AgentStatusIpcPayload. 109/109 in-scope tests pass. Co-authored-by: Orca <help@stably.ai>
…atus-preserve-restart # Conflicts: # src/main/index.ts # src/renderer/src/hooks/useIpcEvents.test.ts # src/renderer/src/hooks/useIpcEvents.ts
- Defensive `lastStatusByPaneKey.clear()` at top of `hydrateLastStatusFromDisk` keeps repeat-start() calls from silently merging prior-session state. - When sanitize drops entries (drift, TTL, schema), log a single `[agent-hooks] last-status hydrate dropped N entries (kept M)` warn and synchronously rewrite the file. Pre-fix, stale entries stayed on disk until a fresh hook event triggered a debounced write — users who hadn't run an agent in 8+ days would re-drop the same entries every cold boot. - Prime `lastWrittenJson` from the raw on-disk bytes (instead of re-serializing) when hydration is lossless — robust against future shape drift in `serializeStatusFile`. - `LAST_STATUS_FILE_VERSION = 2` comment now records why v1 was skipped (in-flight branch shape). - IPC test mock uses `vi.importActual` for `isValidPaneKey` so it stays in sync with the real validator. Co-authored-by: Orca <help@stably.ai>
Without this, agent rows the user already visited come back bold every relaunch now that the rows themselves survive restart (per docs/agent-dashboard-retention-restart.md). Hydrate sanitizes input field-by-field (rejects null/non-object/array, prototype-pollution keys, non-finite/non-positive values) and applies a 7-day TTL paralleling HYDRATE_MAX_AGE_MS in agent-hooks/server.ts so hard-quit/crash paths can't grow the persisted map forever. Co-authored-by: Orca <help@stably.ai>
Doc was a working artifact for this branch; the rationale lives in commit history and the comments next to the persistence/hydrate code. Scrubs the three call-site references that named it. Co-authored-by: Orca <help@stably.ai>
Resolves the agent-hooks refactor collision from #1678 (shared listener + relay adapter). The refactor extracted listener internals into `src/shared/agent-hook-listener.ts` and replaced server.ts's module-level `lastStatusByPaneKey` Map with `state.lastStatusByPaneKey` on a shared `HookListenerState`. This branch's persistence layer is re-anchored on the new shape: - `AgentStatusIpcPayload` now carries `connectionId: string | null` (from main) alongside `receivedAt` / `stateStartedAt` (from this branch). - `src/main/agent-hooks/server.ts` rewritten as the slim adapter (~720 LoC) over the shared listener: defines a server-process-only `EnrichedAgentHookEventPayload = AgentHookEventPayload & {receivedAt, stateStartedAt}` stored in `state.lastStatusByPaneKey` (the shared module never reads this map, so the extra fields ride along untouched), keeps `last-status.json` v2 hydrate / sanitize / TTL / atomic-write / drop semantics. The new HTTP and `ingestRemote` paths both run through `attachStatusTiming` before caching. - `src/main/index.ts` IPC fanout forwards the union: connectionId + receivedAt + stateStartedAt + ...payload. - `src/preload/{api-types,index}.ts` keep the typed `AgentStatusIpcPayload` surface (which now subsumes both branches' fields). - `src/main/agent-hooks/server.test.ts`: persistence tests preserved as-is; ingestRemote tests from main re-laxed from `toHaveBeenCalledWith({...})` to `expect.objectContaining({...})` so the listener's enriched payload doesn't fail strict equality. Verified: pnpm tc:node + tc:web clean; 776/776 vitest tests pass across agent-hooks, shared listener, relay, IPC handler, and renderer agent-status slice + ui slice. tc:cli has pre-existing TS6307 errors on origin/main that are not introduced by this merge. Co-authored-by: Orca <help@stably.ai>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
lastStatusByPaneKeytouserData/agent-hooks/last-status.json(atomic write, 250ms trailing debounce, sync flush onstop()). Fixesdone,blocked, and quietworkingrows winking out after restart.agentStatus:dropIPC so renderer dismissals (dropAgentStatus,dismissRetainedAgentsByWorktree) propagate to the main-process cache and the on-disk file, preventing dismissed rows from resurrecting on relaunch.useIpcEventssosetListener()replay during window creation isn't dropped whileApp.tsxis still hydratingtabsByWorktree; drains on theworkspaceSessionReadyfalse→true transition.[agent-hooks] last-status hydrate dropped N entries (kept M)warn for visibility. Pre-fix, corrupt/stale entries stayed on disk until a fresh hook event triggered a debounced write.main: persistence/hydration are unconditional (the oldexperimentalAgentDashboardruntime gate is gone — only remains in persistence migration code).agentStatus:setis forwarded from main unconditionally withreceivedAtandstateStartedAt.Test plan
pnpm typecheck(pnpm run tc:node,pnpm run tc:web)pnpm test src/main/agent-hooks/server.test.ts(persistence/hydration tests)pnpm test src/main/ipc/agent-hooks.test.ts(IPC handler tests)pnpm test src/renderer/src/store/slices/agent-status-drop-ipc.test.ts(slice IPC fan-out)pnpm test src/renderer/src/hooks/agent-status-bootstrap-queue.test.ts(queue/drain/cap)useIpcEvents.test.tsmocks updated foruseAppStore.subscribeserver.test.ts+agent-hooks.test.ts: 80/80 pass after hydrate-cleanup fixManual — verified via synthetic hook POSTs + CDP introspection on a dev build
Driven by sending real Claude-shaped payloads to the loopback hook server, then reading
last-status.jsonandwindow.api.agentStatus.getSnapshot()to confirm behavior. Quit/relaunch was the actual quit/relaunch of an Electron dev instance.POST /hook/claudewith aStoppayload writes a v2 envelope atlast-status.json(mode0o600);getSnapshot()returns the entry withreceivedAt/stateStartedAt.agentStatus:dropIPC: renderer call removes the entry from both the snapshot and the on-disk file within ~250ms.doneentry + liveUserPromptSubmitPOST for the same paneKey → entry transitions toworkingwith the new prompt and freshreceivedAt; on-disk file follows; no duplicate.[agent-hooks] last-status file is not valid JSON; ignoringwarn; empty snapshot; dashboard renders normally."version": 1envelope →version mismatch (1 != 2); ignoring; empty snapshot.tabId/paneKeydrift rejected (with disk cleanup): file with one good entry + one entry whosetabIddisagrees with the paneKey prefix → only the good one hydrates; drift entry purged fromlast-status.jsonsynchronously during hydrate; singledropped 1 entries (kept 1)warn.receivedAt10 days old → only the 5 fresh entries hydrate (HYDRATE_MAX_AGE_MS = 7d); 5 stale entries purged fromlast-status.jsonsynchronously during hydrate; singledropped 5 entries (kept 5)warn.Manual —
acknowledgedAgentsByPaneKeypersistence + hydrate sanitizer (this branch's last two commits)Verified the latest changes (
d6827bd3+bff99897) on a dev build with an isolatedORCA_DEV_USER_DATA_PATH=/tmp/orca-ack-restart-test, drivingacknowledgeAgentsand the persistence pipeline through CDP and inspectingorca-data.jsondirectly.getDefaultUIState()writesui.acknowledgedAgentsByPaneKey: {}toorca-data.json; in-memory map matches.acknowledgeAgents(['tab-test-1:0', 'tab-test-2:0', 'tab-test-3:1'])→ after the 150msApp.tsxdebounce + 300ms persistence debounce, all 3 keys land underui.acknowledgedAgentsByPaneKeywith valid timestamps. Confirms the new field is wired into the existingwindow.api.ui.seteffect.hydratePersistedUIrestores the exact same 3 keys with their pre-quit timestamps.orca-data.jsonto inject a mix of valid + malicious entries (TTL-expired @ 8d, negative, zero, non-numeric,__proto__/constructor/prototypekeys, plus 2 valid entries within the 7d TTL). After relaunch, only the 2 valid entries hydrated. NoObject.prototypepollution from the__proto__key.ui:setwrite to overwriteorca-data.jsonwith only the sanitized + new entries — all injected garbage was gone from disk.Manual — still requires a real agent to verify end-to-end (not yet done)
doneagent restart: row reappears within first dashboard frame.blockedClaude restart: row reappears with prompt + tool name.workingrestart: last-known row reappears; updates naturally on next event.last-status.jsonacross restart.Made with Orca 🐋