[Phase 9.1] Implement long-running Worker Deployment [READY TO START]

## Summary

Deploy Roboflow as long-running worker pods (Deployment) instead of ephemeral K8s Jobs. Workers claim jobs from TiKV queue and run continuously.

## Status: READY TO START 🔥

**Next in critical path after #48 (Graceful Shutdown)**

## Dependencies

- ✅ #35 (Pipeline Integration) - COMPLETE
- ✅ #48 (Graceful Shutdown) - Ready to start
- Enables: ➡️ #20 (Container Image and Helm Chart)

## Architecture Change

**Previous Design (K8s Jobs):**
- Controller creates K8s Job per manifest
- Job runs, completes, deleted
- New job for each file

**New Design (Long-running Workers):**
- Deployment with N replicas
- Each pod runs worker loop continuously
- Claims jobs from TiKV queue
- No controller needed for job creation

## Deployment Model

```yaml
Deployment: roboflow-worker
├── replicas: N (auto-scaled)
├── Pod Template
│   ├── Container: roboflow
│   │   ├── Command: roboflow worker
│   │   ├── Resources: 1 GPU, 4 CPU, 16GB RAM
│   │   └── Env: TIKV_PD_ENDPOINTS, OSS credentials
│   └── Volumes: (none - stateless)
└── HPA: Scale on pending jobs metric
```

## Tasks

### 1. Update Binary Entry Point

1. Add `worker` subcommand to CLI in `src/bin/roboflow.rs`
2. `roboflow worker` starts worker loop
3. Configuration via env vars or config file

### 2. Create Deployment Manifest

1. Create `deploy/k8s/deployment.yaml`
2. Define Deployment:
   - Name: roboflow-worker
   - Replicas: configurable
   - Pod template with worker container
3. Resource requests/limits:
   - GPU: 1
   - CPU: 4 (request), 8 (limit)
   - Memory: 16GB (request), 32GB (limit)

### 3. Create ConfigMap

1. Create `deploy/k8s/configmap.yaml`
2. Contents:
   - TiKV PD endpoints
   - Input/output bucket names
   - Pipeline configuration
   - Checkpoint intervals
   - Logging level

### 4. Create Secret References

1. Create `deploy/k8s/secrets.yaml` (template)
2. Reference existing secrets for:
   - OSS credentials
   - TiKV credentials (if needed)
3. Or use IRSA/workload identity

### 5. Create HorizontalPodAutoscaler

1. Create `deploy/k8s/hpa.yaml`
2. Scale based on:
   - Custom metric: pending_jobs in TiKV
   - Or: CPU utilization (simpler)
3. Min replicas: 2
4. Max replicas: configurable (default: 100)

### 6. Create PodDisruptionBudget

1. Create `deploy/k8s/pdb.yaml`
2. Ensure at least N pods running during updates
3. Allow rolling updates without job loss

### 7. Create Scanner Deployment

1. Create `deploy/k8s/scanner.yaml`
2. Single replica (leader elected)
3. Or: Include scanner in worker pods
4. Recommendation: Separate deployment for clarity

### 8. Add Health Probes

1. Liveness probe:
   - HTTP endpoint `/health/live`
   - Or: Check heartbeat age
2. Readiness probe:
   - HTTP endpoint `/health/ready`
   - Ready when connected to TiKV
3. Startup probe:
   - Allow time for initial connection

### 9. Add Prometheus ServiceMonitor

1. Create `deploy/k8s/servicemonitor.yaml`
2. Scrape metrics from workers
3. For use with kube-prometheus-stack

## Kubernetes Resources

```
deploy/k8s/
├── namespace.yaml
├── configmap.yaml
├── secrets.yaml (template)
├── deployment.yaml (workers)
├── scanner.yaml (scanner)
├── service.yaml (metrics)
├── hpa.yaml
├── pdb.yaml
└── servicemonitor.yaml
```

## Files to Create

- `deploy/k8s/namespace.yaml`
- `deploy/k8s/configmap.yaml`
- `deploy/k8s/secrets.yaml`
- `deploy/k8s/deployment.yaml`
- `deploy/k8s/scanner.yaml`
- `deploy/k8s/service.yaml`
- `deploy/k8s/hpa.yaml`
- `deploy/k8s/pdb.yaml`
- `deploy/k8s/servicemonitor.yaml`

## Files to Modify

- `src/bin/roboflow.rs` (add worker subcommand)

## Acceptance Criteria

- [ ] `roboflow worker` command works
- [ ] Deployment manifest creates pods
- [ ] Workers connect to TiKV and claim jobs
- [ ] ConfigMap configures workers
- [ ] Secrets mounted correctly
- [ ] HPA scales on pending jobs
- [ ] PDB prevents disruption
- [ ] Health probes work
- [ ] Rolling updates work
- [ ] Integration test in K8s cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 9.1] Implement long-running Worker Deployment [READY TO START] #18

Summary

Status: READY TO START 🔥

Dependencies

Architecture Change

Deployment Model

Tasks

1. Update Binary Entry Point

2. Create Deployment Manifest

3. Create ConfigMap

4. Create Secret References

5. Create HorizontalPodAutoscaler

6. Create PodDisruptionBudget

7. Create Scanner Deployment

8. Add Health Probes

9. Add Prometheus ServiceMonitor

Kubernetes Resources

Files to Create

Files to Modify

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Phase 9.1] Implement long-running Worker Deployment [READY TO START] #18

Description

Summary

Status: READY TO START 🔥

Dependencies

Architecture Change

Deployment Model

Tasks

1. Update Binary Entry Point

2. Create Deployment Manifest

3. Create ConfigMap

4. Create Secret References

5. Create HorizontalPodAutoscaler

6. Create PodDisruptionBudget

7. Create Scanner Deployment

8. Add Health Probes

9. Add Prometheus ServiceMonitor

Kubernetes Resources

Files to Create

Files to Modify

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions