Skip to content

Conversation

@zhexuany
Copy link
Contributor

@zhexuany zhexuany commented Feb 1, 2026

Description

Implements long-running worker deployment for distributed data processing, moving from ephemeral K8s Jobs to continuous worker pods that claim jobs from a TiKV queue.

Type of Change

  • New feature (non-breaking change which adds functionality)

Related Issues

Fixes #18

Changes Made

  • New roboflow binary (src/bin/roboflow.rs)

    • worker subcommand - claims and processes jobs from TiKV continuously
    • scanner subcommand - discovers files and creates jobs with leader election
    • health subcommand - standalone health server for testing
    • HTTP health endpoints: /health/live, /health/ready, /health, /metrics
    • Graceful shutdown with signal handling
  • Kubernetes manifests (deploy/k8s/)

    • deployment.yaml - Worker pods with scanner sidecar
    • configmap.yaml - Full configuration via environment variables
    • hpa.yaml - HorizontalPodAutoscaler (2-100 replicas)
    • pdb.yaml - PodDisruptionBudget for graceful updates
    • servicemonitor.yaml - Prometheus metrics scraping
    • scanner-standalone.yaml - Optional standalone scanner deployment
    • namespace.yaml, secrets.yaml - Cluster resources
    • README.md - Comprehensive deployment documentation
  • Code quality improvements

    • Health server bind failures now propagate to caller
    • Job release errors logged at CRITICAL level
    • Scanner batch_put failures return error instead of silently skipping
    • Invalid file patterns log warnings
    • Storage URL errors include context
    • Consistent pod ID generation across worker and scanner

Testing

  • All existing tests pass
  • Pre-commit hooks (fmt, clippy) pass
  • Binary compiles with --features distributed

Test Commands Used:

cargo build --bin roboflow --features distributed
cargo test --features distributed
cargo clippy --all-targets --all-features -- -D warnings

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code where necessary
  • I have updated the documentation accordingly
  • My changes generate no new warnings
  • New and existing tests pass locally

Add new `roboflow` binary with worker, scanner, and health subcommands
for distributed processing. Workers claim jobs from TiKV queue
continuously instead of ephemeral K8s Jobs.

Changes:
- Add `src/bin/roboflow.rs`: CLI binary with worker/scanner/health commands
- Add `deploy/k8s/`: Kubernetes manifests for worker deployment
  - deployment.yaml: Worker pods with scanner sidecar
  - configmap.yaml: Full configuration via environment variables
  - namespace.yaml, secrets.yaml: Cluster resources
  - hpa.yaml: HorizontalPodAutoscaler (2-100 replicas)
  - pdb.yaml: PodDisruptionBudget for graceful updates
  - servicemonitor.yaml: Prometheus metrics scraping
  - scanner-standalone.yaml: Optional standalone scanner
  - README.md: Comprehensive deployment documentation

The worker subcommand integrates existing worker.rs functionality
with HTTP health probes (/health/live, /health/ready) and
graceful shutdown handling.
Fixes all critical and high-severity issues from code review:

- Add /metrics endpoint to health server (returns placeholder metrics)
- Log warning when SCANNER_FILE_PATTERN is invalid instead of silently failing
- Propagate health server bind failures to caller with startup verification
- Add HTTP robustness: timeouts, increased buffer, better error handling
- Change socket accept errors from debug to warn level for production visibility
- Fix pod ID generation consistency between worker and scanner
- Add hostname crate for consistent pod ID generation
- Add error context when creating storage backend (shows URL on failure)
- Log CRITICAL errors when job release fails during shutdown
- Return error on batch_put failure instead of silently skipping files
- Fix tokio spawn panic handling by tracking server task

Health server improvements:
- Startup confirmation channel with timeout
- Request timeout to prevent hanging connections
- Better request path parsing
- Promethes /metrics endpoint with basic metrics
@greptile-apps
Copy link

greptile-apps bot commented Feb 1, 2026

Greptile Overview

Greptile Summary

Implemented long-running worker deployment with K8s manifests, transitioning from ephemeral jobs to continuous worker pods that claim jobs from TiKV queue.

Key Changes:

  • New roboflow binary with worker, scanner, and health subcommands for distributed processing
  • K8s deployment with worker pods and scanner sidecar containers using leader election
  • Health endpoints (/health/live, /health/ready, /metrics) with startup validation and graceful shutdown
  • Error handling improvements: Job release failures now logged at CRITICAL level; scanner batch_put failures now return errors instead of silently skipping files
  • Port conflict resolution: Scanner sidecar now runs on port 8081 to avoid conflict with worker on 8080
  • Production-ready manifests: HPA (2-100 replicas), PodDisruptionBudget, ServiceMonitor, anti-affinity rules, and comprehensive configuration via ConfigMap

Issues Found:

  • Minor documentation inconsistency: README architecture diagram shows scanner on port 8080 instead of 8081 (already fixed in deployment.yaml)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects well-structured implementation with proper error handling, the port conflict from previous review has been resolved, comprehensive testing mentioned in PR description, and only one minor documentation fix needed
  • deploy/k8s/README.md requires port number correction in architecture diagram

Important Files Changed

Filename Overview
src/bin/roboflow.rs New CLI binary with worker/scanner/health subcommands, health server, and graceful shutdown handling
deploy/k8s/deployment.yaml Worker deployment with scanner sidecar on different ports (8080/8081), health probes configured
deploy/k8s/configmap.yaml Configuration with TiKV endpoints, worker/scanner settings, and environment variables
deploy/k8s/README.md Deployment documentation with architecture diagram showing incorrect scanner port (8080 instead of 8081)
crates/roboflow-distributed/src/worker.rs Improved error handling for job release failures during shutdown with CRITICAL logging
crates/roboflow-distributed/src/scanner.rs Changed batch_put failures to return errors instead of silently continuing, preventing file skipping

Sequence Diagram

sequenceDiagram
    participant Storage as S3/OSS Storage
    participant Scanner as Scanner Pod
    participant TiKV as TiKV Cluster
    participant Worker as Worker Pod
    
    Note over Scanner: Leader Election
    Scanner->>TiKV: Acquire leadership lock
    TiKV-->>Scanner: Lock acquired
    
    loop Every scan_interval (60s)
        Scanner->>Storage: List files in input_prefix
        Storage-->>Scanner: File list
        Scanner->>TiKV: Batch check existing jobs
        TiKV-->>Scanner: Job status map
        Scanner->>TiKV: Batch create jobs for new files
        TiKV-->>Scanner: Jobs created
    end
    
    Note over Worker: Job Processing Loop
    loop Every poll_interval (5s)
        Worker->>TiKV: Query pending jobs
        TiKV-->>Worker: Job record
        Worker->>TiKV: CAS claim job (Pending→Processing)
        TiKV-->>Worker: Claim successful
        
        par Job Processing
            Worker->>Storage: Read input file
            Storage-->>Worker: MCAP/bag data
            Worker->>Worker: Convert to LeRobot format
            Worker->>Storage: Write output dataset
            Storage-->>Worker: Write complete
        and Heartbeat Loop
            loop Every 30s
                Worker->>TiKV: Send heartbeat
                TiKV-->>Worker: Heartbeat recorded
            end
        and Checkpointing
            loop Every 100 frames or 10s
                Worker->>TiKV: Save checkpoint state
                TiKV-->>Worker: Checkpoint saved
            end
        end
        
        alt Success
            Worker->>TiKV: Update job status (Complete)
            TiKV-->>Worker: Status updated
        else Failure
            Worker->>TiKV: Update job status (Failed)
            TiKV-->>Worker: Status updated
        end
    end
    
    Note over Scanner,Worker: Health Probes
    Worker->>Worker: /health/live :8080
    Scanner->>Scanner: /health/live :8081
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@zhexuany
Copy link
Contributor Author

zhexuany commented Feb 2, 2026

@greptile-apps please re-review

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@zhexuany zhexuany merged commit fb72fba into main Feb 2, 2026
3 checks passed
@zhexuany zhexuany deleted the feat/worker-deployment branch February 2, 2026 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Phase 9.1] Implement long-running Worker Deployment [READY TO START]

1 participant