-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Parent: #204 | Phase 3: Multi-Rig + Scaling
Goal
Handle the ephemeral disk problem. When a container sleeps or dies, in-flight state must be recoverable from DO state and remote git branches.
Background
Cloudflare Containers have ephemeral disk — when a container sleeps or restarts, all filesystem state (git repos, worktrees, node_modules) is lost. Since all coordination state lives in DOs, the main recovery concern is git state.
Strategy
1. Git State Recovery
On container start, the control server reads Rig DO state to determine which rigs need repos cloned and which agents need worktrees:
Container starts → control server boots
→ Reads rig registry from Town DO
→ For each rig with active agents:
→ Clone repo (or pull if warm)
→ Create worktrees for active agent branches (branches exist on remote)
→ Report ready to DO
→ DO alarm dispatches pending agents
2. Uncommitted Work Protection
Agents should commit and push frequently. The polecat system prompt instructs:
- Commit after meaningful progress (not just at
gt_done) - Push branch to remote after each commit
- Use
gt_checkpointto write recovery metadata to the DO
3. Checkpoint/Restore via DO
The gt_checkpoint tool writes JSON to the DO's agent record. On restart, gt_prime includes the checkpoint in the agent's context so it can resume from where it left off.
4. Proactive Git Push
The polecat system prompt instructs agents to push their branch after meaningful progress, not just at gt_done. This ensures the remote has latest state for recovery.
Dependencies
- PR 4 (Town Container)
- PR 5 (Rig DO Alarm)
- PR 9 (Town DO — rig registry)
Acceptance Criteria
- Container startup sequence reads DO state and restores git environment
- Active agent worktrees re-created from remote branches on restart
-
gt_checkpointdata included ingt_primecontext for recovery - System prompt updates instructing frequent commit/push
- Container health endpoint reports recovery progress
- Integration test: container sleep → wake → agents resume work