[Core|Dataset] Ray job stuck with idle actors with no tasks

### What happened + What you expected to happen

**What happened**

Our ray job intermittently gets stuck. The Ray job is submitted using the RayJob CRD. We use ray data to read dataset and map batches to distribute the data. On the dashboard we see that under Ray Data Overview there are pending tasks, however under Ray Core Overview everything is finished. We do not see any errors in the /tmp/ray/session_latest/logs directory. This is the [script](https://github.com/vllm-project/vllm/blob/4abf6336ec65c270343eb895e7b18786e9274176/examples/offline_inference_distributed.py) that is used as the entrypoint.

Original Slack thead: https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1716310603790369

The behaviour that we see is at the very end of the job there is always an actor or two which is alive but is idle. Although there are tasks pending as seem dashboard under Ray Data Overview section but they are not being assigned to the idle actor/s. Killing actor process also does not help.

Is there any way to recover from this? We see this happens when job has completed about 95-99%, the only option is to kill the job and rerun again. Is there a way in Ray Dataset to log/checkpoint the batches which are yet to be processed when a job is killed?

**What you expected to happen**
Expected to run the job without any issues.


### Versions / Dependencies

Initially observed the issue with 2.9.3, however same issue was seen with 2.23.0 as well.

### Reproduction script

https://github.com/vllm-project/vllm/blob/4abf6336ec65c270343eb895e7b18786e9274176/examples/offline_inference_distributed.py

### Issue Severity

High: It blocks me from completing my task. This issue is stopping us from adopting Ray as a batch inferencing solution for LLMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core|Dataset] Ray job stuck with idle actors with no tasks #45822

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development