[core] Preserve err type in case of task cancellation due to actor death #57095

codope · 2025-10-01T16:59:41Z

Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError.

Closes #57092

Note

Ensure cancellations caused by actor death return RayActorError (with death cause) instead of TaskCancelledError; update tests accordingly.

Core (task submission):
- In src/ray/core_worker/task_submission/actor_task_submitter.cc, refine handling of status.IsSchedulingCancelled() in HandlePushTaskReply:
  - Detect if the actor is DEAD via client_queues_ and propagate death-cause-derived error (rpc::ErrorType from gcs::GetErrorInfoFromActorDeathCause), yielding RayActorError semantics.
  - Otherwise, keep reporting rpc::ErrorType::TASK_CANCELLED with a concise message.
- Minor logging and error_info construction adjustments.
Tests:
- In python/ray/tests/test_actor_failures.py, test_exit_actor_queued now expects only ray.exceptions.RayActorError for both async and sync cases.

^{Written by Cursor Bugbot for commit 0474ea9. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request correctly adjusts the expected exceptions in the Python tests to align with the intended behavior of raising RayActorError on actor death. The core logic change in actor_task_submitter.cc is a good step towards distinguishing task cancellation reasons. However, I've identified a race condition that could still lead to the wrong exception type being raised. I've also suggested a minor refactoring to improve code maintainability.

src/ray/core_worker/task_submission/actor_task_submitter.cc

codope · 2025-10-02T02:07:39Z

Let's hold on from merging this PR. I just noticed that the test_actor_failures did run successfully in premerge for #56159 -- https://buildkite.com/ray-project/premerge/builds/50170#01999b55-e7a4-4f25-98bc-46c736f9c623/186-4343. And I also ran the test on devbox after reverting #57083, it passed. So, the original premise of #57083 might not be correct.

@Sparks0219 can you link the postmerge failure where test_actor_failures::test_exit_actor_queued failed? Or if you were able to repro locally, then the logs would be super helpful.

Sparks0219 · 2025-10-02T02:42:56Z

Thats super weird... is it due to differing instances types maybe? On my devbox it's failing on master without the pytest fix:

(myenv) ubuntu@devbox:~/ray$ python -m pytest python/ray/tests/test_actor_failures.py::test_exit_actor_queued -v -s
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.9.23, pytest-8.4.1, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/myenv/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/ray
configfile: pytest.ini
plugins: repeat-0.9.4, asyncio-1.2.0, anyio-4.9.0, retry-1.7.0
asyncio: mode=strict, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                                     

python/ray/tests/test_actor_failures.py::test_exit_actor_queued 2025-10-02 02:41:12,951	INFO worker.py:2004 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(raylet) Warning: More than 5000 tasks are pending submission to actor eea8dc9a62b8cb8de3bdff5f01000000. To reduce memory usage, wait for these tasks to finish before sending more.
FAILED(raylet) Warning: More than 10000 tasks are pending submission to actor c87d9a0c6623219fd5c695fc01000000. To reduce memory usage, wait for these tasks to finish before sending more.


============================================================================================== FAILURES ==============================================================================================
_______________________________________________________________________________________ test_exit_actor_queued _______________________________________________________________________________________

shutdown_only = None

    def test_exit_actor_queued(shutdown_only):
        """Verify after exit_actor is called the queued tasks won't execute."""
    
        @ray.remote
        class RegressionSync:
            def f(self):
                import time
    
                time.sleep(1)
                exit_actor()
    
            def ping(self):
                pass
    
        @ray.remote
        class RegressionAsync:
            async def f(self):
                await asyncio.sleep(1)
                exit_actor()
    
            def ping(self):
                pass
    
        # Test async case.
        # https://github.com/ray-project/ray/issues/32376
        # If we didn't fix this issue, this will segfault.
        a = RegressionAsync.remote()
        a.f.remote()
        refs = [a.ping.remote() for _ in range(10000)]
        with pytest.raises(ray.exceptions.RayActorError
        ) as exc_info:
            ray.get(refs)
        assert " Worker unexpectedly exits" not in str(exc_info.value)
    
        # Test a sync case.
        a = RegressionSync.remote()
        a.f.remote()
        with pytest.raises(ray.exceptions.RayActorError
        ) as exc_info:
>           ray.get([a.ping.remote() for _ in range(10000)])

python/ray/tests/test_actor_failures.py:1196: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
python/ray/_private/auto_init_hook.py:22: in auto_init_wrapper
    return fn(*args, **kwargs)
python/ray/_private/client_mode_hook.py:104: in wrapper
    return func(*args, **kwargs)
python/ray/_private/worker.py:2962: in get
    values, debugger_breakpoint = worker.get_objects(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ray._private.worker.Worker object at 0x7f77e2a43730>
object_refs = [ObjectRef(9078e25b3cd9207ac87d9a0c6623219fd5c695fc0100000001000000), ObjectRef(f8570aca6e94fb7ac87d9a0c6623219fd5c695...4f4c87d9a0c6623219fd5c695fc0100000001000000), ObjectRef(852ddd9a96c844dcc87d9a0c6623219fd5c695fc0100000001000000), ...]
timeout = None, return_exceptions = False, skip_deserialization = False, _tensor_transport = None

    def get_objects(
        self,
        object_refs: list,
        timeout: Optional[float] = None,
        return_exceptions: bool = False,
        skip_deserialization: bool = False,
        _tensor_transport: Optional[str] = None,
    ) -> Tuple[List[serialization.SerializedRayObject], bytes]:
        """Get the values in the object store associated with the IDs.
    
        Return the values from the local object store for object_refs. This
        will block until all the values for object_refs have been written to
        the local object store.
    
        Args:
            object_refs: A list of the object refs
                whose values should be retrieved.
            timeout: The maximum amount of time in
                seconds to wait before returning.
            return_exceptions: If any of the objects deserialize to an
                Exception object, whether to return them as values in the
                returned list. If False, then the first found exception will be
                raised.
            skip_deserialization: If true, only the buffer will be released and
                the object associated with the buffer will not be deserialized.
            _tensor_transport: [Alpha] The tensor transport to use to fetch `torch.Tensors` found in the Ray Direct Transport object. Currently, this supports "object_store" and "nixl".
        Returns:
            list: List of deserialized objects or None if skip_deserialization is True.
            bytes: UUID of the debugger breakpoint we should drop
                into or b"" if there is no breakpoint.
        """
        # Make sure that the values are object refs.
        for object_ref in object_refs:
            if not isinstance(object_ref, ObjectRef):
                raise TypeError(
                    f"Attempting to call `get` on the value {object_ref}, "
                    "which is not an ray.ObjectRef."
                )
        tensor_transport: TensorTransportEnum = (
            TensorTransportEnum.from_str(_tensor_transport)
            if _tensor_transport is not None
            else None
        )
        assert tensor_transport in [
            TensorTransportEnum.OBJECT_STORE,
            TensorTransportEnum.NIXL,
            None,
        ], "Currently, RDT only supports 'object_store' and 'nixl' for tensor transport in ray.get()."
        timeout_ms = (
            int(timeout * 1000) if timeout is not None and timeout != -1 else -1
        )
        serialized_objects: List[
            serialization.SerializedRayObject
        ] = self.core_worker.get_objects(
            object_refs,
            timeout_ms,
        )
    
        debugger_breakpoint = b""
        for data, metadata, _ in serialized_objects:
            if metadata:
                metadata_fields = metadata.split(b",")
                if len(metadata_fields) >= 2 and metadata_fields[1].startswith(
                    ray_constants.OBJECT_METADATA_DEBUG_PREFIX
                ):
                    debugger_breakpoint = metadata_fields[1][
                        len(ray_constants.OBJECT_METADATA_DEBUG_PREFIX) :
                    ]
        if skip_deserialization:
            return None, debugger_breakpoint
    
        values = self.deserialize_objects(
            serialized_objects, object_refs, tensor_transport_hint=tensor_transport
        )
        if not return_exceptions:
            # Raise exceptions instead of returning them to the user.
            for i, value in enumerate(values):
                if isinstance(value, RayError):
                    if isinstance(value, ray.exceptions.ObjectLostError):
                        global_worker.core_worker.log_plasma_usage()
                    if isinstance(value, RayTaskError):
                        raise value.as_instanceof_cause()
                    else:
>                       raise value
E                       ray.exceptions.TaskCancelledError: The task aa9c16e4326c9c3ec87d9a0c6623219fd5c695fc01000000 is canceled from an actor c87d9a0c6623219fd5c695fc01000000 before it executes.

python/ray/_private/worker.py:1028: TaskCancelledError
====================================================================================== short test summary info =======================================================================================
FAILED python/ray/tests/test_actor_failures.py::test_exit_actor_queued - ray.exceptions.TaskCancelledError: The task aa9c16e4326c9c3ec87d9a0c6623219fd5c695fc01000000 is canceled from an actor c87d9a0c6623219fd5c695fc01000000 before it executes.

And here is the automated issue that flagged some failing postmerge tests: #44058 (comment)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

…n the RPC response itself Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope · 2025-10-03T09:35:29Z

Let's hold on from merging this PR. I just noticed that the test_actor_failures did run successfully in premerge for #56159 -- https://buildkite.com/ray-project/premerge/builds/50170#01999b55-e7a4-4f25-98bc-46c736f9c623/186-4343. And I also ran the test on devbox after reverting #57083, it passed. So, the original premise of #57083 might not be correct.

@Sparks0219 can you link the postmerge failure where test_actor_failures::test_exit_actor_queued failed? Or if you were able to repro locally, then the logs would be super helpful.

Discussed offline. There was a difference in my test env (devbox is using different instance type and i think instance type in ci also got updated recently). The issue is legit and reproducible. I have addressed all the other comments.

jjyao · 2025-10-03T17:24:42Z

The issue is legit and reproducible.

@codope have you fixed the issue yet?

israbbani · 2025-10-03T17:49:43Z

Let's hold on from merging this PR. I just noticed that the test_actor_failures did run successfully in premerge for #56159 -- https://buildkite.com/ray-project/premerge/builds/50170#01999b55-e7a4-4f25-98bc-46c736f9c623/186-4343. And I also ran the test on devbox after reverting #57083, it passed. So, the original premise of #57083 might not be correct.
@Sparks0219 can you link the postmerge failure where test_actor_failures::test_exit_actor_queued failed? Or if you were able to repro locally, then the logs would be super helpful.

Discussed offline. There was a difference in my test env (devbox is using different instance type and i think instance type in ci also got updated recently). The issue is legit and reproducible. I have addressed all the other comments.

I have a few questions about this

Why did the difference in instance type cause the test results to be flaky?
If this test runs on premerge, how did it pass in the first place?

src/ray/core_worker/task_submission/actor_task_submitter.cc

src/ray/core_worker/task_execution/task_receiver.cc

src/ray/core_worker/task_submission/actor_task_submitter.cc

codope · 2025-10-04T04:02:56Z

@codope have you fixed the issue yet?

@jjyao yes this PR fixes the issue.

codope · 2025-10-04T04:10:17Z

Why did the difference in instance type cause the test results to be flaky?

I am not sure but this is obviously a sign of timing-dependent issues. The current fix uses reply.worker_exiting(), which travels atomically with the RPC response so there is no race condition.

If this test runs on premerge, how did it pass in the first place?

I think between the premerge and postmerge, the CI env must have changed.

…rwise, add the task to wait_for_death_info_tasks_ to wait for the death notification from GCS within a grace period Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

jjyao · 2025-10-06T15:59:52Z

Test failures

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

israbbani

LGTM. I left one comment. If we need to unblock v2.50, we can merge this and do that separately. @edoakes / @jjyao

israbbani · 2025-10-07T22:16:50Z

src/ray/core_worker/task_submission/actor_task_submitter.cc

+      auto queue_pair = client_queues_.find(actor_id);
+      if (queue_pair != client_queues_.end()) {


This is a RAY_CHECK failure in the case where the actor is not dead. I don't like adding more RAY_CHECKs, but I don't see how we can recover if we're handling a PushTaskReply and the actor doesn't have a client_queue_.

edoakes · 2025-10-07T22:34:44Z

LGTM. I left one comment. If we need to unblock v2.50, we can merge this and do that separately. @edoakes / @jjyao

will do this to keep in @aslonnie's good graces

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ath (#57538) cherrypick #57095 Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes #57092 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>

…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

…ath (ray-project#57095) Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError. Closes ray-project#57092 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

codope requested a review from a team as a code owner October 1, 2025 16:59

codope added the go add ONLY when ready to merge, run all tests label Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

src/ray/core_worker/task_submission/actor_task_submitter.cc Outdated Show resolved Hide resolved

src/ray/core_worker/task_submission/actor_task_submitter.cc Outdated Show resolved Hide resolved

jjyao assigned edoakes Oct 1, 2025

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 1, 2025

edoakes reviewed Oct 1, 2025

View reviewed changes

src/ray/core_worker/task_submission/actor_task_submitter.cc Outdated Show resolved Hide resolved

codope added 2 commits October 3, 2025 06:33

[core] Preserve err type in case of task cancellation due to actor death

9421de6

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

instead of checking state at the submitter, include the information i…

8806b54

…n the RPC response itself Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope force-pushed the task-cancellation-exception branch from 0474ea9 to 8806b54 Compare October 3, 2025 06:34

jjyao requested a review from edoakes October 3, 2025 17:23

israbbani reviewed Oct 3, 2025

View reviewed changes

If we already have the death cause from GCS, use it immediately. Othe…

c921d91

…rwise, add the task to wait_for_death_info_tasks_ to wait for the death notification from GCS within a grace period Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

tasks rejected for non-shutdown reasons get error status

753f2cd

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

jjyao requested a review from israbbani October 7, 2025 04:17

israbbani reviewed Oct 7, 2025

View reviewed changes

edoakes approved these changes Oct 7, 2025

View reviewed changes

edoakes merged commit f9cb400 into ray-project:master Oct 7, 2025
6 checks passed

aslonnie added a commit that referenced this pull request Oct 8, 2025

cherrypick #57095

60344b1

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

aslonnie mentioned this pull request Oct 8, 2025

[core] Preserve err type in case of task cancellation due to actor death #57538

Merged

		auto queue_pair = client_queues_.find(actor_id);
		if (queue_pair != client_queues_.end()) {

[core] Preserve err type in case of task cancellation due to actor death #57095

[core] Preserve err type in case of task cancellation due to actor death #57095

Uh oh!

Conversation

codope commented Oct 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codope commented Oct 2, 2025

Uh oh!

Sparks0219 commented Oct 2, 2025

Uh oh!

codope commented Oct 3, 2025

Uh oh!

jjyao commented Oct 3, 2025

Uh oh!

israbbani commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codope commented Oct 4, 2025

Uh oh!

codope commented Oct 4, 2025

Uh oh!

jjyao commented Oct 6, 2025

Uh oh!

israbbani left a comment

Choose a reason for hiding this comment

Uh oh!

israbbani Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codope commented Oct 1, 2025 •

edited by cursor bot

Loading