-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Parent: #204 | Phase 4: Hardening
Revised: Edge cases updated for container-per-town model (container OOM, ephemeral disk, process-level isolation).
Goal
Handle edge cases and failure modes gracefully.
Edge Cases
- Split-brain: Two processes for the same agent (race on restart) → Rig DO enforces single-writer per agent, container checks DO state before starting
- Concurrent writes to same bead: SQLite serialization in DO handles this, but add optimistic locking for cross-DO operations
- DO eviction during alarm: Alarms are durable and will re-fire
- Container OOM: Kills all agents. DO alarms detect dead agents, new container starts, agents re-dispatched from DO state
- Container sleep during active work: Agents must have pushed to remote. DO re-dispatches on wake. Checkpoint data in DO enables resumption
- Gateway outage: Agent retries built into Kilo CLI; escalation if persistent
- Partial
agentDone: What if the polecat pushed the branch but thegt_donecall failed? Checkpoint-based recovery - Duplicate mail delivery: Idempotency on mail delivery marking
- Convoy with failed beads: Policy for partial convoy completion
- Git worktree conflicts: Two agents accidentally assigned same branch → Rig DO enforces unique branch per agent
Dependencies
- PR 5 (Rig DO Alarm — witness patrol)
- PR 10 (Multiple Polecats)
Acceptance Criteria
- Single-writer enforcement per agent (reject duplicate dispatch)
- Container OOM recovery flow tested (DO re-dispatches all agents)
- Optimistic locking for cross-DO operations
- Checkpoint-based recovery for partial done flows
- Idempotent mail delivery
- Convoy partial completion policy implemented
- All edge cases documented with test coverage
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels