[Feature] Ray Cluster: Preserving Job State After Cluster Restart #2479

xiaoming12306 · 2024-10-29T09:13:16Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Problem Description:
I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.

Steps to Reproduce:

Deploy a Ray cluster using KubeRay on a Kubernetes environment.
Submit multiple jobs to the Ray cluster.
Restart the Ray cluster (either manually or simulating a failure/recovery scenario).
Observe that the previous job states are not preserved and are lost post-restart.
Expected Behavior:
Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.

Use case

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

kevin85421 · 2024-12-04T18:28:54Z

Currently, Ray users need to implement the checkpointing mechanism in the Ray application script. Maybe https://docs.ray.io/en/latest/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.html#kuberay-distributed-checkpointing-gcsefuse is helpful.

xiaoming12306 added enhancement New feature or request triage labels Oct 29, 2024

kevin85421 added checkpoint and removed triage labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Ray Cluster: Preserving Job State After Cluster Restart #2479

[Feature] Ray Cluster: Preserving Job State After Cluster Restart #2479

xiaoming12306 commented Oct 29, 2024

kevin85421 commented Dec 4, 2024

[Feature] Ray Cluster: Preserving Job State After Cluster Restart #2479

[Feature] Ray Cluster: Preserving Job State After Cluster Restart #2479

Comments

xiaoming12306 commented Oct 29, 2024

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

kevin85421 commented Dec 4, 2024