Skip to content

[Gastown] PR 18: Container Resilience — Checkpoint/Restore #269

@jrf0110

Description

@jrf0110

Parent: #204 | Phase 3: Multi-Rig + Scaling

Goal

Handle the ephemeral disk problem. When a container sleeps or dies, in-flight state must be recoverable from DO state and remote git branches.

Background

Cloudflare Containers have ephemeral disk — when a container sleeps or restarts, all filesystem state (git repos, worktrees, node_modules) is lost. Since all coordination state lives in DOs, the main recovery concern is git state.

Strategy

1. Git State Recovery

On container start, the control server reads Rig DO state to determine which rigs need repos cloned and which agents need worktrees:

Container starts → control server boots
→ Reads rig registry from Town DO
→ For each rig with active agents:
  → Clone repo (or pull if warm)
  → Create worktrees for active agent branches (branches exist on remote)
→ Report ready to DO
→ DO alarm dispatches pending agents

2. Uncommitted Work Protection

Agents should commit and push frequently. The polecat system prompt instructs:

  • Commit after meaningful progress (not just at gt_done)
  • Push branch to remote after each commit
  • Use gt_checkpoint to write recovery metadata to the DO

3. Checkpoint/Restore via DO

The gt_checkpoint tool writes JSON to the DO's agent record. On restart, gt_prime includes the checkpoint in the agent's context so it can resume from where it left off.

4. Proactive Git Push

The polecat system prompt instructs agents to push their branch after meaningful progress, not just at gt_done. This ensures the remote has latest state for recovery.

Dependencies

  • PR 4 (Town Container)
  • PR 5 (Rig DO Alarm)
  • PR 9 (Town DO — rig registry)

Acceptance Criteria

  • Container startup sequence reads DO state and restores git environment
  • Active agent worktrees re-created from remote branches on restart
  • gt_checkpoint data included in gt_prime context for recovery
  • System prompt updates instructing frequent commit/push
  • Container health endpoint reports recovery progress
  • Integration test: container sleep → wake → agents resume work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions