Skip to content

[Gastown] PR 7.5: Agent Lifecycle & Container Reliability Fixes #335

@jrf0110

Description

@jrf0110

Parent: #204 | Phase 1: Single Rig, Single Polecat

Goal

Fix three reliability issues discovered during local dev testing that cause the agent dispatch loop to retry indefinitely, waste resources, and prevent agents from completing real work.

Context

During end-to-end testing of the Mayor chat flow (PR 7 → PR 6 → PR 5 → PR 4), we found that while the full pipeline works (alarm fires → container dispatch → kilo serve → session creation → prompt delivery), agents exit immediately and the system enters an infinite retry loop. Three issues need to be addressed before PR 8 (Manual Merge Flow) can work correctly.

Issues

1. No API credentials passed to kilo serve

The container starts kilo serve without KILO_API_URL or any API key, so the kilo session has no way to call an LLM. The agent starts a session, sends the prompt, but the session completes instantly with no useful work because there are no model credentials.

Fix: The startAgentInContainer flow needs to pass KILO_API_URL (and any required auth) through to the container's buildAgentEnv(). The Rig DO config or worker environment should supply these values so kilo serve can route LLM calls through the Kilo gateway.

2. No retry limit / circuit breaker for agent dispatch

When an agent exits immediately (e.g., due to missing credentials or a crash), the alarm loop runs indefinitely:

  1. witnessPatrol sees agent container status = exited → resets agent to idle
  2. schedulePendingWork finds idle agent with hooked bead → re-dispatches → 201 success
  3. 30s later, agent has exited again → repeat forever

There is no max retry count, backoff, or circuit breaker. The system creates dozens of kilo sessions per minute in the container, all exiting immediately.

Fix: Track dispatch attempts per agent (or per bead). After N consecutive failed dispatches (e.g., agent exits within a short window), mark the bead as failed and stop retrying. Optionally create an escalation. Consider exponential backoff before the hard limit.

3. Agent completion does not close the bead

When an agent's session completes (detected via SSE isCompletionEvent), the process-manager.ts sets agent.status = 'exited' and agent.exitReason = 'completed', but nothing calls back to the Rig DO to transition the bead from in_progress to closed. The agent exits, witnessPatrol finds it, resets the agent to idle — but the bead stays in_progress with the agent still hooked, so schedulePendingWork re-dispatches.

In the normal flow (PR 8), gt_done handles this via agentDone(). But for the Mayor agent (which may complete without calling gt_done), there needs to be a mechanism to detect completion and close the bead. Options:

  • The container's process manager could call a Rig DO endpoint on agent completion (e.g., POST /api/rigs/:rigId/agents/:agentId/done)
  • witnessPatrol could detect exitReason = 'completed' and auto-close the bead
  • The heartbeat mechanism could report completion state back to the DO

Dependencies

Acceptance Criteria

  • KILO_API_URL and auth credentials are passed through to kilo serve processes in the container
  • Agent dispatch has a retry limit (configurable, e.g., 5 attempts). After exceeding the limit, the bead is marked as failed
  • When an agent session completes, the bead is transitioned to closed (or failed if the session errored)
  • The infinite alarm retry loop no longer occurs for agents that consistently fail to start or complete immediately
  • witnessPatrol or the container reports agent completion back to the Rig DO

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions