feat: distributed agent cluster for massively parallel task execution

## Context

oh-im-broke philosophy: Use what you have. Turn spare laptops into an agent swarm.

### Current Limitations
- Default concurrency: 5 agents per model
- Single-machine bottleneck
- Potential OOM when running many parallel agents
- Underutilized spare hardware (old laptops, extra desktops)

### Vision
Transform heterogeneous machines (desktop + spare laptops with different specs) into a distributed agent cluster:
- **Master node**: Main desktop running OpenCode
- **Worker nodes**: Spare laptops/machines running agent workers  
- **Simple setup**: Docker-based, no complex Kubernetes
- **Resource-aware**: Tasks distributed based on machine capabilities
- **Fault-tolerant**: Worker failures don't crash entire system
- **Monitored**: Real-time visibility into cluster health/utilization

## Goals

1. **Massive Parallelism**: Run 20+ concurrent agents across cluster
2. **Cost Efficiency**: Use existing hardware instead of cloud
3. **Simple Setup**: `docker-compose up` level simplicity
4. **Resource Monitoring**: Track memory/CPU/agent utilization per node
5. **Graceful Degradation**: Continue working if workers go offline

## Architecture (Draft)

### Components Needed

1. **Task Queue System**
   - Distribute `delegate_task` calls to available workers
   - Priority queuing (high/medium/low)
   - Retry logic for failed tasks

2. **Worker Manager**
   - Register/deregister workers dynamically
   - Health checks (heartbeat)
   - Capability reporting (CPU/RAM/GPU)

3. **Resource Monitor**
   - Per-node metrics (memory, CPU, active agents)
   - Cluster-wide dashboard
   - Alerts for OOM/overload conditions

4. **Budget Orchestrator Extension**
   - Track usage across all nodes
   - Distribute free-tier usage across workers
   - Failover when node exhausts resources

### Existing Infrastructure to Leverage

- ✅ Background agent manager (`src/features/background-agent/manager.ts`)
- ✅ Concurrency limits (`src/features/background-agent/concurrency.ts`)
- ✅ WebUI with stats/monitoring endpoints
- ✅ Budget orchestrator with provider tracking
- ✅ Usage tracking system
- ⚠️ **Needs extension**: All currently single-node

## Research Needed

- [ ] Lightweight orchestration options (Docker Swarm, K3s, Nomad, etc.)
- [ ] Task queue systems (Celery, BullMQ, RQ, etc.)
- [ ] Similar projects (LLM agent clusters, distributed AI inference)
- [ ] Resource monitoring tools compatible with heterogeneous hardware
- [ ] Network protocols for task distribution (gRPC, WebSockets, HTTP/2)

## Proposed Sub-Issues

1. Resource monitoring & OOM detection (prerequisite)
2. Task queue abstraction layer
3. Worker node implementation (Docker image)
4. Master-worker communication protocol
5. Cluster configuration & discovery
6. WebUI cluster dashboard
7. Budget orchestrator cluster support
8. Fault tolerance & failover logic
9. Documentation & setup guide

## Success Criteria

- [ ] Run 20+ concurrent agents across 3+ machines
- [ ] Automatic worker registration/deregistration
- [ ] Real-time cluster health monitoring
- [ ] Graceful handling of worker failures
- [ ] Setup time < 30 minutes for new worker node
- [ ] All existing features work unchanged (backward compatible)

## Non-Goals (for initial implementation)

- ❌ Perfect load balancing (simple round-robin OK)
- ❌ Auto-scaling (manual worker addition OK)
- ❌ GPU distribution (future enhancement)
- ❌ Cross-internet workers (LAN only initially)

## Related Issues

- #2 - Hybrid provider pricing (free tier distribution across cluster)
- (To be created: Resource monitoring)
- (To be created: OOM detection)

---

**Timeline**: Research → Prototype → Incremental implementation
**Priority**: Medium (after current budget orchestrator work completes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: distributed agent cluster for massively parallel task execution #5

Context

Current Limitations

Vision

Goals

Architecture (Draft)

Components Needed

Existing Infrastructure to Leverage

Research Needed

Proposed Sub-Issues

Success Criteria

Non-Goals (for initial implementation)

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: distributed agent cluster for massively parallel task execution #5

Description

Context

Current Limitations

Vision

Goals

Architecture (Draft)

Components Needed

Existing Infrastructure to Leverage

Research Needed

Proposed Sub-Issues

Success Criteria

Non-Goals (for initial implementation)

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions