-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
The DecoupledCheckpointEngine has multiple critical issues that can cause training jobs to hang indefinitely:
-
No timeout on process join:
cleanup()callsself.ckpt_process.join()with no timeout - if the checkpoint process crashes or hangs, the training script hangs forever. -
No timeout on event wait:
commit()callsself.save_event.wait()with no timeout - if the checkpoint process dies before setting the event, training hangs forever. -
Assertion used for runtime validation: Line 119 uses
assert info == self.commit_infowhich is disabled withpython -O, allowing silent data corruption in production. -
__del__calls blocking cleanup: The destructor callscleanup(), so any exception during training can cause the program to hang on exit. -
No process health checks: No verification that
ckpt_processis still alive before waiting on events or queues.
To Reproduce
- Enable decoupled checkpointing in DeepSpeed config
- Start a training job that saves checkpoints
- Kill or crash the checkpoint subprocess (e.g., OOM, signal, or disk full)
- Observe that the main training process hangs indefinitely on
save_event.wait()orckpt_process.join()
Alternatively:
- Run training with
python -O(optimized mode) - Trigger a checkpoint with mismatched
commit_info - The assert is skipped, potentially corrupting checkpoint state
Expected behavior
- Checkpoint operations should have timeouts and fail gracefully with clear error messages
- Process health should be checked before blocking waits
- Runtime validation should use proper
ifstatements, not assertions - Training should be able to recover or exit cleanly if checkpointing fails
Affected Code
File: deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py
| Line | Issue | Severity |
|---|---|---|
| 119 | assert info == self.commit_info - disabled with -O |
Critical |
| 123 | self.save_event.wait() - no timeout |
Critical |
| 155 | self.ckpt_process.join() - no timeout |
Critical |
| 99-100 | __del__ calls cleanup() - can hang on destruction |
High |
System info
- Affects: All systems using
DecoupledCheckpointEngine - OS: Any
- GPU: Any
- Python: Any version
Proposed Fix
# Replace assert with proper validation
if info != self.commit_info:
raise ValueError(f"Checkpoint commit info mismatch: expected {self.commit_info}, got {info}")
# Add timeout to event wait with process health check
TIMEOUT_SECONDS = 300 # 5 minutes
while not self.save_event.wait(timeout=10):
if not self.ckpt_process.is_alive():
raise RuntimeError("Checkpoint process died unexpectedly")
# Add timeout to process join
self.ckpt_process.join(timeout=TIMEOUT_SECONDS)
if self.ckpt_process.is_alive():
self.ckpt_process.terminate()
raise RuntimeError("Checkpoint process failed to exit within timeout")