You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar feature requirement.
Description
Problem Description:
I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.
Steps to Reproduce:
Deploy a Ray cluster using KubeRay on a Kubernetes environment.
Submit multiple jobs to the Ray cluster.
Restart the Ray cluster (either manually or simulating a failure/recovery scenario).
Observe that the previous job states are not preserved and are lost post-restart.
Expected Behavior:
Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.
Use case
Related issues
No response
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Description
Problem Description:
I am experiencing an issue where the job state is lost after restarting the Ray cluster deployed via KubeRay on Kubernetes. This causes significant disruption as we cannot resume the tasks where they left off, and it requires re-executing the entire workload, leading to inefficiencies and increased computation costs.
Steps to Reproduce:
Deploy a Ray cluster using KubeRay on a Kubernetes environment.
Submit multiple jobs to the Ray cluster.
Restart the Ray cluster (either manually or simulating a failure/recovery scenario).
Observe that the previous job states are not preserved and are lost post-restart.
Expected Behavior:
Post-restart, the Ray cluster should be able to retain or restore the job states so that the jobs can either resume from where they were left or can be conveniently restarted based on the last saved state.
Use case
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: