[Core|Dataset] Ray job stuck with idle actors with no tasks #45822
Description
What happened + What you expected to happen
What happened
Our ray job intermittently gets stuck. The Ray job is submitted using the RayJob CRD. We use ray data to read dataset and map batches to distribute the data. On the dashboard we see that under Ray Data Overview there are pending tasks, however under Ray Core Overview everything is finished. We do not see any errors in the /tmp/ray/session_latest/logs directory. This is the script that is used as the entrypoint.
Original Slack thead: https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1716310603790369
The behaviour that we see is at the very end of the job there is always an actor or two which is alive but is idle. Although there are tasks pending as seem dashboard under Ray Data Overview section but they are not being assigned to the idle actor/s. Killing actor process also does not help.
Is there any way to recover from this? We see this happens when job has completed about 95-99%, the only option is to kill the job and rerun again. Is there a way in Ray Dataset to log/checkpoint the batches which are yet to be processed when a job is killed?
What you expected to happen
Expected to run the job without any issues.
Versions / Dependencies
Initially observed the issue with 2.9.3, however same issue was seen with 2.23.0 as well.
Reproduction script
Issue Severity
High: It blocks me from completing my task. This issue is stopping us from adopting Ray as a batch inferencing solution for LLMs.