[Phase 7.2] Add graceful shutdown handling [READY 🔥]

## Summary

Implement graceful shutdown for distributed workers to properly release jobs and clean up state on SIGTERM/SIGINT.

## Priority: HIGH 🔥

**READY TO START** - Dependencies met. This enables K8s deployment (#18).

## Dependencies

- ✅ Worker Loop - COMPLETE
- ✅ Heartbeat - COMPLETE  
- ✅ LerobotWriter Integration (#72) - COMPLETE
- ✅ Checkpoint Save (#73) - COMPLETE

## Enables

- ➡️ #18 (K8s Worker Deployment)
- ➡️ #20 (Container Image and Helm Chart)

## Design

### Shutdown Sequence

1. Receive SIGTERM or SIGINT
2. Set shutdown flag
3. Stop accepting new jobs
4. Complete current checkpoint (not entire job)
5. Release job back to Pending
6. Clear heartbeat
7. Exit cleanly

### Timeout

- Force exit after 30 seconds
- Log warning if forced

## Tasks

### 1. Define Shutdown Handler

1. Create `crates/roboflow-distributed/src/shutdown.rs`
2. Define `ShutdownHandler`:
   ```rust
   pub struct ShutdownHandler {
       shutdown_flag: Arc<AtomicBool>,
       shutdown_tx: broadcast::Sender<()>,
   }
   ```
3. Signal registration for SIGTERM, SIGINT

### 2. Implement Signal Handler

1. Use `tokio::signal` crate
2. On signal:
   - Set shutdown_flag
   - Send to shutdown_tx channel
   - Log "Shutdown requested"

### 3. Integrate with Worker Loop

1. Check shutdown_flag in main loop:
   ```rust
   loop {
       if shutdown.is_requested() {
           break;
       }
       // ... claim and process jobs
   }
   ```
2. Don't claim new jobs after shutdown

### 4. Integrate with Pipeline

1. Add shutdown check to ProgressCallback
2. On shutdown during processing:
   - Complete current batch (checkpoint boundary)
   - Save checkpoint
   - Return early with Interrupted error

### 5. Implement Job Release

1. On shutdown while processing:
   - Save final checkpoint
   - Update job: status=Pending, owner=null
   - Preserve checkpoint for next worker
2. Log job release

### 6. Implement Heartbeat Cleanup

1. On shutdown:
   - Stop heartbeat thread
   - Delete `/heartbeat/{pod_id}` key
2. Prevent false zombie detection

### 7. Implement Timeout

1. Start timeout timer on shutdown signal
2. After 30 seconds: force exit
3. Log warning with current state

### 8. K8s Integration

1. K8s sends SIGTERM first
2. Then waits terminationGracePeriodSeconds
3. Then SIGKILL
4. Set terminationGracePeriodSeconds >= 35s

## Files to Create

- `crates/roboflow-distributed/src/shutdown.rs`

## Files to Modify

- `crates/roboflow-distributed/src/worker.rs`
- `crates/roboflow-distributed/src/heartbeat.rs`
- `crates/roboflow-dataset/src/streaming/converter.rs` (add shutdown check callback)

## Acceptance Criteria

- [ ] Signal handler registered
- [ ] Shutdown flag propagates
- [ ] Worker stops accepting jobs
- [ ] Pipeline exits at checkpoint boundary
- [ ] Job released to Pending
- [ ] Heartbeat cleaned up
- [ ] Timeout forces exit
- [ ] Integration test: send SIGTERM during processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 7.2] Add graceful shutdown handling [READY 🔥] #48

Summary

Priority: HIGH 🔥

Dependencies

Enables

Design

Shutdown Sequence

Timeout

Tasks

1. Define Shutdown Handler

2. Implement Signal Handler

3. Integrate with Worker Loop

4. Integrate with Pipeline

5. Implement Job Release

6. Implement Heartbeat Cleanup

7. Implement Timeout

8. K8s Integration

Files to Create

Files to Modify

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Phase 7.2] Add graceful shutdown handling [READY 🔥] #48

Description

Summary

Priority: HIGH 🔥

Dependencies

Enables

Design

Shutdown Sequence

Timeout

Tasks

1. Define Shutdown Handler

2. Implement Signal Handler

3. Integrate with Worker Loop

4. Integrate with Pipeline

5. Implement Job Release

6. Implement Heartbeat Cleanup

7. Implement Timeout

8. K8s Integration

Files to Create

Files to Modify

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions