Skip to content

[Phase 7.2] Add graceful shutdown handling [READY πŸ”₯]Β #48

@zhexuany

Description

@zhexuany

Summary

Implement graceful shutdown for distributed workers to properly release jobs and clean up state on SIGTERM/SIGINT.

Priority: HIGH πŸ”₯

READY TO START - Dependencies met. This enables K8s deployment (#18).

Dependencies

Enables

Design

Shutdown Sequence

  1. Receive SIGTERM or SIGINT
  2. Set shutdown flag
  3. Stop accepting new jobs
  4. Complete current checkpoint (not entire job)
  5. Release job back to Pending
  6. Clear heartbeat
  7. Exit cleanly

Timeout

  • Force exit after 30 seconds
  • Log warning if forced

Tasks

1. Define Shutdown Handler

  1. Create crates/roboflow-distributed/src/shutdown.rs
  2. Define ShutdownHandler:
    pub struct ShutdownHandler {
        shutdown_flag: Arc<AtomicBool>,
        shutdown_tx: broadcast::Sender<()>,
    }
  3. Signal registration for SIGTERM, SIGINT

2. Implement Signal Handler

  1. Use tokio::signal crate
  2. On signal:
    • Set shutdown_flag
    • Send to shutdown_tx channel
    • Log "Shutdown requested"

3. Integrate with Worker Loop

  1. Check shutdown_flag in main loop:
    loop {
        if shutdown.is_requested() {
            break;
        }
        // ... claim and process jobs
    }
  2. Don't claim new jobs after shutdown

4. Integrate with Pipeline

  1. Add shutdown check to ProgressCallback
  2. On shutdown during processing:
    • Complete current batch (checkpoint boundary)
    • Save checkpoint
    • Return early with Interrupted error

5. Implement Job Release

  1. On shutdown while processing:
    • Save final checkpoint
    • Update job: status=Pending, owner=null
    • Preserve checkpoint for next worker
  2. Log job release

6. Implement Heartbeat Cleanup

  1. On shutdown:
    • Stop heartbeat thread
    • Delete /heartbeat/{pod_id} key
  2. Prevent false zombie detection

7. Implement Timeout

  1. Start timeout timer on shutdown signal
  2. After 30 seconds: force exit
  3. Log warning with current state

8. K8s Integration

  1. K8s sends SIGTERM first
  2. Then waits terminationGracePeriodSeconds
  3. Then SIGKILL
  4. Set terminationGracePeriodSeconds >= 35s

Files to Create

  • crates/roboflow-distributed/src/shutdown.rs

Files to Modify

  • crates/roboflow-distributed/src/worker.rs
  • crates/roboflow-distributed/src/heartbeat.rs
  • crates/roboflow-dataset/src/streaming/converter.rs (add shutdown check callback)

Acceptance Criteria

  • Signal handler registered
  • Shutdown flag propagates
  • Worker stops accepting jobs
  • Pipeline exits at checkpoint boundary
  • Job released to Pending
  • Heartbeat cleaned up
  • Timeout forces exit
  • Integration test: send SIGTERM during processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions