feat: add long-running worker deployment with K8s manifests #82

zhexuany · 2026-02-01T19:10:12Z

Description

Implements long-running worker deployment for distributed data processing, moving from ephemeral K8s Jobs to continuous worker pods that claim jobs from a TiKV queue.

Type of Change

New feature (non-breaking change which adds functionality)

Related Issues

Fixes #18

Changes Made

New roboflow binary (src/bin/roboflow.rs)
- worker subcommand - claims and processes jobs from TiKV continuously
- scanner subcommand - discovers files and creates jobs with leader election
- health subcommand - standalone health server for testing
- HTTP health endpoints: /health/live, /health/ready, /health, /metrics
- Graceful shutdown with signal handling
Kubernetes manifests (deploy/k8s/)
- deployment.yaml - Worker pods with scanner sidecar
- configmap.yaml - Full configuration via environment variables
- hpa.yaml - HorizontalPodAutoscaler (2-100 replicas)
- pdb.yaml - PodDisruptionBudget for graceful updates
- servicemonitor.yaml - Prometheus metrics scraping
- scanner-standalone.yaml - Optional standalone scanner deployment
- namespace.yaml, secrets.yaml - Cluster resources
- README.md - Comprehensive deployment documentation
Code quality improvements
- Health server bind failures now propagate to caller
- Job release errors logged at CRITICAL level
- Scanner batch_put failures return error instead of silently skipping
- Invalid file patterns log warnings
- Storage URL errors include context
- Consistent pod ID generation across worker and scanner

Testing

All existing tests pass
Pre-commit hooks (fmt, clippy) pass
Binary compiles with --features distributed

Test Commands Used:

cargo build --bin roboflow --features distributed
cargo test --features distributed
cargo clippy --all-targets --all-features -- -D warnings

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code where necessary
I have updated the documentation accordingly
My changes generate no new warnings
New and existing tests pass locally

Add new `roboflow` binary with worker, scanner, and health subcommands for distributed processing. Workers claim jobs from TiKV queue continuously instead of ephemeral K8s Jobs. Changes: - Add `src/bin/roboflow.rs`: CLI binary with worker/scanner/health commands - Add `deploy/k8s/`: Kubernetes manifests for worker deployment - deployment.yaml: Worker pods with scanner sidecar - configmap.yaml: Full configuration via environment variables - namespace.yaml, secrets.yaml: Cluster resources - hpa.yaml: HorizontalPodAutoscaler (2-100 replicas) - pdb.yaml: PodDisruptionBudget for graceful updates - servicemonitor.yaml: Prometheus metrics scraping - scanner-standalone.yaml: Optional standalone scanner - README.md: Comprehensive deployment documentation The worker subcommand integrates existing worker.rs functionality with HTTP health probes (/health/live, /health/ready) and graceful shutdown handling.

Fixes all critical and high-severity issues from code review: - Add /metrics endpoint to health server (returns placeholder metrics) - Log warning when SCANNER_FILE_PATTERN is invalid instead of silently failing - Propagate health server bind failures to caller with startup verification - Add HTTP robustness: timeouts, increased buffer, better error handling - Change socket accept errors from debug to warn level for production visibility - Fix pod ID generation consistency between worker and scanner - Add hostname crate for consistent pod ID generation - Add error context when creating storage backend (shows URL on failure) - Log CRITICAL errors when job release fails during shutdown - Return error on batch_put failure instead of silently skipping files - Fix tokio spawn panic handling by tracking server task Health server improvements: - Startup confirmation channel with timeout - Request timeout to prevent hanging connections - Better request path parsing - Promethes /metrics endpoint with basic metrics

greptile-apps · 2026-02-01T19:18:37Z

Greptile Overview

Greptile Summary

Implemented long-running worker deployment with K8s manifests, transitioning from ephemeral jobs to continuous worker pods that claim jobs from TiKV queue.

Key Changes:

New roboflow binary with worker, scanner, and health subcommands for distributed processing
K8s deployment with worker pods and scanner sidecar containers using leader election
Health endpoints (/health/live, /health/ready, /metrics) with startup validation and graceful shutdown
Error handling improvements: Job release failures now logged at CRITICAL level; scanner batch_put failures now return errors instead of silently skipping files
Port conflict resolution: Scanner sidecar now runs on port 8081 to avoid conflict with worker on 8080
Production-ready manifests: HPA (2-100 replicas), PodDisruptionBudget, ServiceMonitor, anti-affinity rules, and comprehensive configuration via ConfigMap

Issues Found:

Minor documentation inconsistency: README architecture diagram shows scanner on port 8080 instead of 8081 (already fixed in deployment.yaml)

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Score reflects well-structured implementation with proper error handling, the port conflict from previous review has been resolved, comprehensive testing mentioned in PR description, and only one minor documentation fix needed
deploy/k8s/README.md requires port number correction in architecture diagram

Important Files Changed

Filename	Overview
src/bin/roboflow.rs	New CLI binary with worker/scanner/health subcommands, health server, and graceful shutdown handling
deploy/k8s/deployment.yaml	Worker deployment with scanner sidecar on different ports (8080/8081), health probes configured
deploy/k8s/configmap.yaml	Configuration with TiKV endpoints, worker/scanner settings, and environment variables
deploy/k8s/README.md	Deployment documentation with architecture diagram showing incorrect scanner port (8080 instead of 8081)
crates/roboflow-distributed/src/worker.rs	Improved error handling for job release failures during shutdown with CRITICAL logging
crates/roboflow-distributed/src/scanner.rs	Changed batch_put failures to return errors instead of silently continuing, preventing file skipping

Sequence Diagram

sequenceDiagram
    participant Storage as S3/OSS Storage
    participant Scanner as Scanner Pod
    participant TiKV as TiKV Cluster
    participant Worker as Worker Pod
    
    Note over Scanner: Leader Election
    Scanner->>TiKV: Acquire leadership lock
    TiKV-->>Scanner: Lock acquired
    
    loop Every scan_interval (60s)
        Scanner->>Storage: List files in input_prefix
        Storage-->>Scanner: File list
        Scanner->>TiKV: Batch check existing jobs
        TiKV-->>Scanner: Job status map
        Scanner->>TiKV: Batch create jobs for new files
        TiKV-->>Scanner: Jobs created
    end
    
    Note over Worker: Job Processing Loop
    loop Every poll_interval (5s)
        Worker->>TiKV: Query pending jobs
        TiKV-->>Worker: Job record
        Worker->>TiKV: CAS claim job (Pending→Processing)
        TiKV-->>Worker: Claim successful
        
        par Job Processing
            Worker->>Storage: Read input file
            Storage-->>Worker: MCAP/bag data
            Worker->>Worker: Convert to LeRobot format
            Worker->>Storage: Write output dataset
            Storage-->>Worker: Write complete
        and Heartbeat Loop
            loop Every 30s
                Worker->>TiKV: Send heartbeat
                TiKV-->>Worker: Heartbeat recorded
            end
        and Checkpointing
            loop Every 100 frames or 10s
                Worker->>TiKV: Save checkpoint state
                TiKV-->>Worker: Checkpoint saved
            end
        end
        
        alt Success
            Worker->>TiKV: Update job status (Complete)
            TiKV-->>Worker: Status updated
        else Failure
            Worker->>TiKV: Update job status (Failed)
            TiKV-->>Worker: Status updated
        end
    end
    
    Note over Scanner,Worker: Health Probes
    Worker->>Worker: /health/live :8080
    Scanner->>Scanner: /health/live :8081

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

deploy/k8s/deployment.yaml

Use port 8081 for scanner health server instead of 8080

zhexuany · 2026-02-02T01:37:48Z

@greptile-apps please re-review

greptile-apps

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

deploy/k8s/README.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

zhexuany added 3 commits February 2, 2026 03:08

style: apply cargo fmt to roboflow.rs

6da14aa

greptile-apps bot reviewed Feb 1, 2026

View reviewed changes

deploy/k8s/deployment.yaml Show resolved Hide resolved

zhexuany added 3 commits February 2, 2026 04:45

fix: resolve scanner/worker port conflict

ac34606

Use port 8081 for scanner health server instead of 8080

style: apply rustfmt formatting for CI compatibility

939aa46

style: apply rustfmt formatting for CI compatibility

b729d51

greptile-apps bot reviewed Feb 2, 2026

View reviewed changes

deploy/k8s/README.md Outdated Show resolved Hide resolved

Update deploy/k8s/README.md

a4385c5

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

zhexuany merged commit fb72fba into main Feb 2, 2026
3 checks passed

zhexuany deleted the feat/worker-deployment branch February 2, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add long-running worker deployment with K8s manifests #82

feat: add long-running worker deployment with K8s manifests #82

Uh oh!

zhexuany commented Feb 1, 2026

Uh oh!

greptile-apps bot commented Feb 1, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

zhexuany commented Feb 2, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add long-running worker deployment with K8s manifests #82

feat: add long-running worker deployment with K8s manifests #82

Uh oh!

Conversation

zhexuany commented Feb 1, 2026

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Uh oh!

greptile-apps bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhexuany commented Feb 2, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Feb 1, 2026 •

edited

Loading