-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Parent: #204 | Phase 1: Single Rig, Single Polecat
Goal
Fix three reliability issues discovered during local dev testing that cause the agent dispatch loop to retry indefinitely, waste resources, and prevent agents from completing real work.
Context
During end-to-end testing of the Mayor chat flow (PR 7 → PR 6 → PR 5 → PR 4), we found that while the full pipeline works (alarm fires → container dispatch → kilo serve → session creation → prompt delivery), agents exit immediately and the system enters an infinite retry loop. Three issues need to be addressed before PR 8 (Manual Merge Flow) can work correctly.
Issues
1. No API credentials passed to kilo serve
The container starts kilo serve without KILO_API_URL or any API key, so the kilo session has no way to call an LLM. The agent starts a session, sends the prompt, but the session completes instantly with no useful work because there are no model credentials.
Fix: The startAgentInContainer flow needs to pass KILO_API_URL (and any required auth) through to the container's buildAgentEnv(). The Rig DO config or worker environment should supply these values so kilo serve can route LLM calls through the Kilo gateway.
2. No retry limit / circuit breaker for agent dispatch
When an agent exits immediately (e.g., due to missing credentials or a crash), the alarm loop runs indefinitely:
witnessPatrolsees agent container status =exited→ resets agent toidleschedulePendingWorkfinds idle agent with hooked bead → re-dispatches → 201 success- 30s later, agent has exited again → repeat forever
There is no max retry count, backoff, or circuit breaker. The system creates dozens of kilo sessions per minute in the container, all exiting immediately.
Fix: Track dispatch attempts per agent (or per bead). After N consecutive failed dispatches (e.g., agent exits within a short window), mark the bead as failed and stop retrying. Optionally create an escalation. Consider exponential backoff before the hard limit.
3. Agent completion does not close the bead
When an agent's session completes (detected via SSE isCompletionEvent), the process-manager.ts sets agent.status = 'exited' and agent.exitReason = 'completed', but nothing calls back to the Rig DO to transition the bead from in_progress to closed. The agent exits, witnessPatrol finds it, resets the agent to idle — but the bead stays in_progress with the agent still hooked, so schedulePendingWork re-dispatches.
In the normal flow (PR 8), gt_done handles this via agentDone(). But for the Mayor agent (which may complete without calling gt_done), there needs to be a mechanism to detect completion and close the bead. Options:
- The container's process manager could call a Rig DO endpoint on agent completion (e.g.,
POST /api/rigs/:rigId/agents/:agentId/done) witnessPatrolcould detectexitReason = 'completed'and auto-close the bead- The heartbeat mechanism could report completion state back to the DO
Dependencies
- PR 5 (Rig DO Alarm —
schedulePendingWork,witnessPatrol) — [Gastown] PR 5: Rig DO Alarm — Work Scheduler #212 - PR 5.5 (Container — kilo serve adoption) — [Gastown] PR 5.5: Container — Adopt
kilo servefor Agent Management #305 - PR 6 (tRPC Routes —
sendMessage) — [Gastown] PR 6: tRPC Routes — Town & Rig Management #268 - PR 7 (Dashboard UI — Mayor chat) — [Gastown] PR 7: Basic Dashboard UI #213
Acceptance Criteria
-
KILO_API_URLand auth credentials are passed through to kilo serve processes in the container - Agent dispatch has a retry limit (configurable, e.g., 5 attempts). After exceeding the limit, the bead is marked as
failed - When an agent session completes, the bead is transitioned to
closed(orfailedif the session errored) - The infinite alarm retry loop no longer occurs for agents that consistently fail to start or complete immediately
-
witnessPatrolor the container reports agent completion back to the Rig DO