Summary
Add checkpoint-based execution so workflows can be resumed from the last checkpoint if a pod dies mid-execution.
Problem
Currently there is no automatic replay from checkpoint. If a pod dies mid-workflow, the entire step must be retried. Users must rely on external state stores and step-level retries.
Proposed change
- Checkpoint API for steps to persist intermediate state
- Controller-driven recovery that resumes from the last checkpoint
- Integration with existing storage offloading infrastructure
References
Summary
Add checkpoint-based execution so workflows can be resumed from the last checkpoint if a pod dies mid-execution.
Problem
Currently there is no automatic replay from checkpoint. If a pod dies mid-workflow, the entire step must be retried. Users must rely on external state stores and step-level retries.
Proposed change
References