Skip to content

Implement durable execution with automatic recovery from checkpoints #42

@lanycrost

Description

@lanycrost

Summary

Add checkpoint-based execution so workflows can be resumed from the last checkpoint if a pod dies mid-execution.

Problem

Currently there is no automatic replay from checkpoint. If a pod dies mid-workflow, the entire step must be retried. Users must rely on external state stores and step-level retries.

Proposed change

  • Checkpoint API for steps to persist intermediate state
  • Controller-driven recovery that resumes from the last checkpoint
  • Integration with existing storage offloading infrastructure

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/operatorBobrapet controller or CRD-level change.kind/featureNew functionality or enhancement request.priority/highImportant issue to schedule soon.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions