-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
area/kubernetesKubernetes and container orchestrationKubernetes and container orchestrationpriority/highHigh priorityHigh prioritysize/XLExtra large: > 2 weeksExtra large: > 2 weekstype/featureNew feature or functionalityNew feature or functionalitytype/infrastructureInfrastructure, CI/CD, DevOpsInfrastructure, CI/CD, DevOps
Description
Summary
Deploy Roboflow as long-running worker pods (Deployment) instead of ephemeral K8s Jobs. Workers claim jobs from TiKV queue and run continuously.
Status: READY TO START 🔥
Next in critical path after #48 (Graceful Shutdown)
Dependencies
- ✅ feat: [Phase 2.4] Implement cached storage backend with local buffer #35 (Pipeline Integration) - COMPLETE
- ✅ [Phase 7.2] Add graceful shutdown handling [READY 🔥] #48 (Graceful Shutdown) - Ready to start
- Enables: ➡️ [Phase 6.2] Create container image and Helm chart [BLOCKED by #18] #20 (Container Image and Helm Chart)
Architecture Change
Previous Design (K8s Jobs):
- Controller creates K8s Job per manifest
- Job runs, completes, deleted
- New job for each file
New Design (Long-running Workers):
- Deployment with N replicas
- Each pod runs worker loop continuously
- Claims jobs from TiKV queue
- No controller needed for job creation
Deployment Model
Deployment: roboflow-worker
├── replicas: N (auto-scaled)
├── Pod Template
│ ├── Container: roboflow
│ │ ├── Command: roboflow worker
│ │ ├── Resources: 1 GPU, 4 CPU, 16GB RAM
│ │ └── Env: TIKV_PD_ENDPOINTS, OSS credentials
│ └── Volumes: (none - stateless)
└── HPA: Scale on pending jobs metricTasks
1. Update Binary Entry Point
- Add
workersubcommand to CLI insrc/bin/roboflow.rs roboflow workerstarts worker loop- Configuration via env vars or config file
2. Create Deployment Manifest
- Create
deploy/k8s/deployment.yaml - Define Deployment:
- Name: roboflow-worker
- Replicas: configurable
- Pod template with worker container
- Resource requests/limits:
- GPU: 1
- CPU: 4 (request), 8 (limit)
- Memory: 16GB (request), 32GB (limit)
3. Create ConfigMap
- Create
deploy/k8s/configmap.yaml - Contents:
- TiKV PD endpoints
- Input/output bucket names
- Pipeline configuration
- Checkpoint intervals
- Logging level
4. Create Secret References
- Create
deploy/k8s/secrets.yaml(template) - Reference existing secrets for:
- OSS credentials
- TiKV credentials (if needed)
- Or use IRSA/workload identity
5. Create HorizontalPodAutoscaler
- Create
deploy/k8s/hpa.yaml - Scale based on:
- Custom metric: pending_jobs in TiKV
- Or: CPU utilization (simpler)
- Min replicas: 2
- Max replicas: configurable (default: 100)
6. Create PodDisruptionBudget
- Create
deploy/k8s/pdb.yaml - Ensure at least N pods running during updates
- Allow rolling updates without job loss
7. Create Scanner Deployment
- Create
deploy/k8s/scanner.yaml - Single replica (leader elected)
- Or: Include scanner in worker pods
- Recommendation: Separate deployment for clarity
8. Add Health Probes
- Liveness probe:
- HTTP endpoint
/health/live - Or: Check heartbeat age
- HTTP endpoint
- Readiness probe:
- HTTP endpoint
/health/ready - Ready when connected to TiKV
- HTTP endpoint
- Startup probe:
- Allow time for initial connection
9. Add Prometheus ServiceMonitor
- Create
deploy/k8s/servicemonitor.yaml - Scrape metrics from workers
- For use with kube-prometheus-stack
Kubernetes Resources
deploy/k8s/
├── namespace.yaml
├── configmap.yaml
├── secrets.yaml (template)
├── deployment.yaml (workers)
├── scanner.yaml (scanner)
├── service.yaml (metrics)
├── hpa.yaml
├── pdb.yaml
└── servicemonitor.yaml
Files to Create
deploy/k8s/namespace.yamldeploy/k8s/configmap.yamldeploy/k8s/secrets.yamldeploy/k8s/deployment.yamldeploy/k8s/scanner.yamldeploy/k8s/service.yamldeploy/k8s/hpa.yamldeploy/k8s/pdb.yamldeploy/k8s/servicemonitor.yaml
Files to Modify
src/bin/roboflow.rs(add worker subcommand)
Acceptance Criteria
-
roboflow workercommand works - Deployment manifest creates pods
- Workers connect to TiKV and claim jobs
- ConfigMap configures workers
- Secrets mounted correctly
- HPA scales on pending jobs
- PDB prevents disruption
- Health probes work
- Rolling updates work
- Integration test in K8s cluster
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/kubernetesKubernetes and container orchestrationKubernetes and container orchestrationpriority/highHigh priorityHigh prioritysize/XLExtra large: > 2 weeksExtra large: > 2 weekstype/featureNew feature or functionalityNew feature or functionalitytype/infrastructureInfrastructure, CI/CD, DevOpsInfrastructure, CI/CD, DevOps