Skip to content

feat: distributed agent cluster for massively parallel task execution #5

@sadnow

Description

@sadnow

Context

oh-im-broke philosophy: Use what you have. Turn spare laptops into an agent swarm.

Current Limitations

  • Default concurrency: 5 agents per model
  • Single-machine bottleneck
  • Potential OOM when running many parallel agents
  • Underutilized spare hardware (old laptops, extra desktops)

Vision

Transform heterogeneous machines (desktop + spare laptops with different specs) into a distributed agent cluster:

  • Master node: Main desktop running OpenCode
  • Worker nodes: Spare laptops/machines running agent workers
  • Simple setup: Docker-based, no complex Kubernetes
  • Resource-aware: Tasks distributed based on machine capabilities
  • Fault-tolerant: Worker failures don't crash entire system
  • Monitored: Real-time visibility into cluster health/utilization

Goals

  1. Massive Parallelism: Run 20+ concurrent agents across cluster
  2. Cost Efficiency: Use existing hardware instead of cloud
  3. Simple Setup: docker-compose up level simplicity
  4. Resource Monitoring: Track memory/CPU/agent utilization per node
  5. Graceful Degradation: Continue working if workers go offline

Architecture (Draft)

Components Needed

  1. Task Queue System

    • Distribute delegate_task calls to available workers
    • Priority queuing (high/medium/low)
    • Retry logic for failed tasks
  2. Worker Manager

    • Register/deregister workers dynamically
    • Health checks (heartbeat)
    • Capability reporting (CPU/RAM/GPU)
  3. Resource Monitor

    • Per-node metrics (memory, CPU, active agents)
    • Cluster-wide dashboard
    • Alerts for OOM/overload conditions
  4. Budget Orchestrator Extension

    • Track usage across all nodes
    • Distribute free-tier usage across workers
    • Failover when node exhausts resources

Existing Infrastructure to Leverage

  • ✅ Background agent manager (src/features/background-agent/manager.ts)
  • ✅ Concurrency limits (src/features/background-agent/concurrency.ts)
  • ✅ WebUI with stats/monitoring endpoints
  • ✅ Budget orchestrator with provider tracking
  • ✅ Usage tracking system
  • ⚠️ Needs extension: All currently single-node

Research Needed

  • Lightweight orchestration options (Docker Swarm, K3s, Nomad, etc.)
  • Task queue systems (Celery, BullMQ, RQ, etc.)
  • Similar projects (LLM agent clusters, distributed AI inference)
  • Resource monitoring tools compatible with heterogeneous hardware
  • Network protocols for task distribution (gRPC, WebSockets, HTTP/2)

Proposed Sub-Issues

  1. Resource monitoring & OOM detection (prerequisite)
  2. Task queue abstraction layer
  3. Worker node implementation (Docker image)
  4. Master-worker communication protocol
  5. Cluster configuration & discovery
  6. WebUI cluster dashboard
  7. Budget orchestrator cluster support
  8. Fault tolerance & failover logic
  9. Documentation & setup guide

Success Criteria

  • Run 20+ concurrent agents across 3+ machines
  • Automatic worker registration/deregistration
  • Real-time cluster health monitoring
  • Graceful handling of worker failures
  • Setup time < 30 minutes for new worker node
  • All existing features work unchanged (backward compatible)

Non-Goals (for initial implementation)

  • ❌ Perfect load balancing (simple round-robin OK)
  • ❌ Auto-scaling (manual worker addition OK)
  • ❌ GPU distribution (future enhancement)
  • ❌ Cross-internet workers (LAN only initially)

Related Issues


Timeline: Research → Prototype → Incremental implementation
Priority: Medium (after current budget orchestrator work completes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions