-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[core] Preserve err type in case of task cancellation due to actor death #57095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Preserve err type in case of task cancellation due to actor death #57095
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adjusts the expected exceptions in the Python tests to align with the intended behavior of raising RayActorError on actor death. The core logic change in actor_task_submitter.cc is a good step towards distinguishing task cancellation reasons. However, I've identified a race condition that could still lead to the wrong exception type being raised. I've also suggested a minor refactoring to improve code maintainability.
|
Let's hold on from merging this PR. I just noticed that the @Sparks0219 can you link the postmerge failure where |
|
Thats super weird... is it due to differing instances types maybe? On my devbox it's failing on master without the pytest fix: And here is the automated issue that flagged some failing postmerge tests: #44058 (comment) |
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…n the RPC response itself Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
0474ea9 to
8806b54
Compare
Discussed offline. There was a difference in my test env (devbox is using different instance type and i think instance type in ci also got updated recently). The issue is legit and reproducible. I have addressed all the other comments. |
@codope have you fixed the issue yet? |
I have a few questions about this
|
I am not sure but this is obviously a sign of timing-dependent issues. The current fix uses reply.worker_exiting(), which travels atomically with the RPC response so there is no race condition.
I think between the premerge and postmerge, the CI env must have changed. |
…rwise, add the task to wait_for_death_info_tasks_ to wait for the death notification from GCS within a grace period Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
|
Test failures |
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
israbbani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto queue_pair = client_queues_.find(actor_id); | ||
| if (queue_pair != client_queues_.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a RAY_CHECK failure in the case where the actor is not dead. I don't like adding more RAY_CHECKs, but I don't see how we can recover if we're handling a PushTaskReply and the actor doesn't have a client_queue_.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError.
Closes #57092
Note
Ensure cancellations caused by actor death return RayActorError (with death cause) instead of TaskCancelledError; update tests accordingly.
src/ray/core_worker/task_submission/actor_task_submitter.cc, refine handling ofstatus.IsSchedulingCancelled()inHandlePushTaskReply:DEADviaclient_queues_and propagate death-cause-derived error (rpc::ErrorTypefromgcs::GetErrorInfoFromActorDeathCause), yieldingRayActorErrorsemantics.rpc::ErrorType::TASK_CANCELLEDwith a concise message.python/ray/tests/test_actor_failures.py,test_exit_actor_queuednow expects onlyray.exceptions.RayActorErrorfor both async and sync cases.Written by Cursor Bugbot for commit 0474ea9. This will update automatically on new commits. Configure here.