Skip to content

[Phase 9.1] Implement long-running Worker Deployment [READY TO START] #18

@zhexuany

Description

@zhexuany

Summary

Deploy Roboflow as long-running worker pods (Deployment) instead of ephemeral K8s Jobs. Workers claim jobs from TiKV queue and run continuously.

Status: READY TO START 🔥

Next in critical path after #48 (Graceful Shutdown)

Dependencies

Architecture Change

Previous Design (K8s Jobs):

  • Controller creates K8s Job per manifest
  • Job runs, completes, deleted
  • New job for each file

New Design (Long-running Workers):

  • Deployment with N replicas
  • Each pod runs worker loop continuously
  • Claims jobs from TiKV queue
  • No controller needed for job creation

Deployment Model

Deployment: roboflow-worker
├── replicas: N (auto-scaled)
├── Pod Template
│   ├── Container: roboflow
│   │   ├── Command: roboflow worker
│   │   ├── Resources: 1 GPU, 4 CPU, 16GB RAM
│   │   └── Env: TIKV_PD_ENDPOINTS, OSS credentials
│   └── Volumes: (none - stateless)
└── HPA: Scale on pending jobs metric

Tasks

1. Update Binary Entry Point

  1. Add worker subcommand to CLI in src/bin/roboflow.rs
  2. roboflow worker starts worker loop
  3. Configuration via env vars or config file

2. Create Deployment Manifest

  1. Create deploy/k8s/deployment.yaml
  2. Define Deployment:
    • Name: roboflow-worker
    • Replicas: configurable
    • Pod template with worker container
  3. Resource requests/limits:
    • GPU: 1
    • CPU: 4 (request), 8 (limit)
    • Memory: 16GB (request), 32GB (limit)

3. Create ConfigMap

  1. Create deploy/k8s/configmap.yaml
  2. Contents:
    • TiKV PD endpoints
    • Input/output bucket names
    • Pipeline configuration
    • Checkpoint intervals
    • Logging level

4. Create Secret References

  1. Create deploy/k8s/secrets.yaml (template)
  2. Reference existing secrets for:
    • OSS credentials
    • TiKV credentials (if needed)
  3. Or use IRSA/workload identity

5. Create HorizontalPodAutoscaler

  1. Create deploy/k8s/hpa.yaml
  2. Scale based on:
    • Custom metric: pending_jobs in TiKV
    • Or: CPU utilization (simpler)
  3. Min replicas: 2
  4. Max replicas: configurable (default: 100)

6. Create PodDisruptionBudget

  1. Create deploy/k8s/pdb.yaml
  2. Ensure at least N pods running during updates
  3. Allow rolling updates without job loss

7. Create Scanner Deployment

  1. Create deploy/k8s/scanner.yaml
  2. Single replica (leader elected)
  3. Or: Include scanner in worker pods
  4. Recommendation: Separate deployment for clarity

8. Add Health Probes

  1. Liveness probe:
    • HTTP endpoint /health/live
    • Or: Check heartbeat age
  2. Readiness probe:
    • HTTP endpoint /health/ready
    • Ready when connected to TiKV
  3. Startup probe:
    • Allow time for initial connection

9. Add Prometheus ServiceMonitor

  1. Create deploy/k8s/servicemonitor.yaml
  2. Scrape metrics from workers
  3. For use with kube-prometheus-stack

Kubernetes Resources

deploy/k8s/
├── namespace.yaml
├── configmap.yaml
├── secrets.yaml (template)
├── deployment.yaml (workers)
├── scanner.yaml (scanner)
├── service.yaml (metrics)
├── hpa.yaml
├── pdb.yaml
└── servicemonitor.yaml

Files to Create

  • deploy/k8s/namespace.yaml
  • deploy/k8s/configmap.yaml
  • deploy/k8s/secrets.yaml
  • deploy/k8s/deployment.yaml
  • deploy/k8s/scanner.yaml
  • deploy/k8s/service.yaml
  • deploy/k8s/hpa.yaml
  • deploy/k8s/pdb.yaml
  • deploy/k8s/servicemonitor.yaml

Files to Modify

  • src/bin/roboflow.rs (add worker subcommand)

Acceptance Criteria

  • roboflow worker command works
  • Deployment manifest creates pods
  • Workers connect to TiKV and claim jobs
  • ConfigMap configures workers
  • Secrets mounted correctly
  • HPA scales on pending jobs
  • PDB prevents disruption
  • Health probes work
  • Rolling updates work
  • Integration test in K8s cluster

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions