-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add long-running worker deployment with K8s manifests #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add new `roboflow` binary with worker, scanner, and health subcommands for distributed processing. Workers claim jobs from TiKV queue continuously instead of ephemeral K8s Jobs. Changes: - Add `src/bin/roboflow.rs`: CLI binary with worker/scanner/health commands - Add `deploy/k8s/`: Kubernetes manifests for worker deployment - deployment.yaml: Worker pods with scanner sidecar - configmap.yaml: Full configuration via environment variables - namespace.yaml, secrets.yaml: Cluster resources - hpa.yaml: HorizontalPodAutoscaler (2-100 replicas) - pdb.yaml: PodDisruptionBudget for graceful updates - servicemonitor.yaml: Prometheus metrics scraping - scanner-standalone.yaml: Optional standalone scanner - README.md: Comprehensive deployment documentation The worker subcommand integrates existing worker.rs functionality with HTTP health probes (/health/live, /health/ready) and graceful shutdown handling.
Fixes all critical and high-severity issues from code review: - Add /metrics endpoint to health server (returns placeholder metrics) - Log warning when SCANNER_FILE_PATTERN is invalid instead of silently failing - Propagate health server bind failures to caller with startup verification - Add HTTP robustness: timeouts, increased buffer, better error handling - Change socket accept errors from debug to warn level for production visibility - Fix pod ID generation consistency between worker and scanner - Add hostname crate for consistent pod ID generation - Add error context when creating storage backend (shows URL on failure) - Log CRITICAL errors when job release fails during shutdown - Return error on batch_put failure instead of silently skipping files - Fix tokio spawn panic handling by tracking server task Health server improvements: - Startup confirmation channel with timeout - Request timeout to prevent hanging connections - Better request path parsing - Promethes /metrics endpoint with basic metrics
Greptile OverviewGreptile SummaryImplemented long-running worker deployment with K8s manifests, transitioning from ephemeral jobs to continuous worker pods that claim jobs from TiKV queue. Key Changes:
Issues Found:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Storage as S3/OSS Storage
participant Scanner as Scanner Pod
participant TiKV as TiKV Cluster
participant Worker as Worker Pod
Note over Scanner: Leader Election
Scanner->>TiKV: Acquire leadership lock
TiKV-->>Scanner: Lock acquired
loop Every scan_interval (60s)
Scanner->>Storage: List files in input_prefix
Storage-->>Scanner: File list
Scanner->>TiKV: Batch check existing jobs
TiKV-->>Scanner: Job status map
Scanner->>TiKV: Batch create jobs for new files
TiKV-->>Scanner: Jobs created
end
Note over Worker: Job Processing Loop
loop Every poll_interval (5s)
Worker->>TiKV: Query pending jobs
TiKV-->>Worker: Job record
Worker->>TiKV: CAS claim job (Pending→Processing)
TiKV-->>Worker: Claim successful
par Job Processing
Worker->>Storage: Read input file
Storage-->>Worker: MCAP/bag data
Worker->>Worker: Convert to LeRobot format
Worker->>Storage: Write output dataset
Storage-->>Worker: Write complete
and Heartbeat Loop
loop Every 30s
Worker->>TiKV: Send heartbeat
TiKV-->>Worker: Heartbeat recorded
end
and Checkpointing
loop Every 100 frames or 10s
Worker->>TiKV: Save checkpoint state
TiKV-->>Worker: Checkpoint saved
end
end
alt Success
Worker->>TiKV: Update job status (Complete)
TiKV-->>Worker: Status updated
else Failure
Worker->>TiKV: Update job status (Failed)
TiKV-->>Worker: Status updated
end
end
Note over Scanner,Worker: Health Probes
Worker->>Worker: /health/live :8080
Scanner->>Scanner: /health/live :8081
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 1 comment
Use port 8081 for scanner health server instead of 8080
|
@greptile-apps please re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 1 comment
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Description
Implements long-running worker deployment for distributed data processing, moving from ephemeral K8s Jobs to continuous worker pods that claim jobs from a TiKV queue.
Type of Change
Related Issues
Fixes #18
Changes Made
New
roboflowbinary (src/bin/roboflow.rs)workersubcommand - claims and processes jobs from TiKV continuouslyscannersubcommand - discovers files and creates jobs with leader electionhealthsubcommand - standalone health server for testing/health/live,/health/ready,/health,/metricsKubernetes manifests (
deploy/k8s/)deployment.yaml- Worker pods with scanner sidecarconfigmap.yaml- Full configuration via environment variableshpa.yaml- HorizontalPodAutoscaler (2-100 replicas)pdb.yaml- PodDisruptionBudget for graceful updatesservicemonitor.yaml- Prometheus metrics scrapingscanner-standalone.yaml- Optional standalone scanner deploymentnamespace.yaml,secrets.yaml- Cluster resourcesREADME.md- Comprehensive deployment documentationCode quality improvements
Testing
--features distributedTest Commands Used:
cargo build --bin roboflow --features distributed cargo test --features distributed cargo clippy --all-targets --all-features -- -D warningsChecklist