Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Ray Cluster: Preserving Job State After Cluster Restart #2479

Open
1 of 2 tasks
xiaoming12306 opened this issue Oct 29, 2024 · 1 comment
Open
1 of 2 tasks
Labels
checkpoint enhancement New feature or request

Comments

@xiaoming12306
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Problem Description:
I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.

Steps to Reproduce:

Deploy a Ray cluster using KubeRay on a Kubernetes environment.
Submit multiple jobs to the Ray cluster.
Restart the Ray cluster (either manually or simulating a failure/recovery scenario).
Observe that the previous job states are not preserved and are lost post-restart.
Expected Behavior:
Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.

Use case

image

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@xiaoming12306 xiaoming12306 added enhancement New feature or request triage labels Oct 29, 2024
@kevin85421
Copy link
Member

Currently, Ray users need to implement the checkpointing mechanism in the Ray application script. Maybe https://docs.ray.io/en/latest/cluster/kubernetes/examples/distributed-checkpointing-with-gcsfuse.html#kuberay-distributed-checkpointing-gcsefuse is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpoint enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants