Skip to content

Conversation

@codope
Copy link
Contributor

@codope codope commented Oct 1, 2025

Check if task cancellation is due to actor shutdown or explicit user cancellation. Actor shutdown should raise RayActorError, not TaskCancelledError.

Closes #57092


Note

Ensure cancellations caused by actor death return RayActorError (with death cause) instead of TaskCancelledError; update tests accordingly.

  • Core (task submission):
    • In src/ray/core_worker/task_submission/actor_task_submitter.cc, refine handling of status.IsSchedulingCancelled() in HandlePushTaskReply:
      • Detect if the actor is DEAD via client_queues_ and propagate death-cause-derived error (rpc::ErrorType from gcs::GetErrorInfoFromActorDeathCause), yielding RayActorError semantics.
      • Otherwise, keep reporting rpc::ErrorType::TASK_CANCELLED with a concise message.
    • Minor logging and error_info construction adjustments.
  • Tests:
    • In python/ray/tests/test_actor_failures.py, test_exit_actor_queued now expects only ray.exceptions.RayActorError for both async and sync cases.

Written by Cursor Bugbot for commit 0474ea9. This will update automatically on new commits. Configure here.

@codope codope requested a review from a team as a code owner October 1, 2025 16:59
@codope codope added the go add ONLY when ready to merge, run all tests label Oct 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adjusts the expected exceptions in the Python tests to align with the intended behavior of raising RayActorError on actor death. The core logic change in actor_task_submitter.cc is a good step towards distinguishing task cancellation reasons. However, I've identified a race condition that could still lead to the wrong exception type being raised. I've also suggested a minor refactoring to improve code maintainability.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 1, 2025
@codope
Copy link
Contributor Author

codope commented Oct 2, 2025

Let's hold on from merging this PR. I just noticed that the test_actor_failures did run successfully in premerge for #56159 -- https://buildkite.com/ray-project/premerge/builds/50170#01999b55-e7a4-4f25-98bc-46c736f9c623/186-4343. And I also ran the test on devbox after reverting #57083, it passed. So, the original premise of #57083 might not be correct.

@Sparks0219 can you link the postmerge failure where test_actor_failures::test_exit_actor_queued failed? Or if you were able to repro locally, then the logs would be super helpful.

@Sparks0219
Copy link
Contributor

Thats super weird... is it due to differing instances types maybe? On my devbox it's failing on master without the pytest fix:

(myenv) ubuntu@devbox:~/ray$ python -m pytest python/ray/tests/test_actor_failures.py::test_exit_actor_queued -v -s
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.9.23, pytest-8.4.1, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/myenv/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/ray
configfile: pytest.ini
plugins: repeat-0.9.4, asyncio-1.2.0, anyio-4.9.0, retry-1.7.0
asyncio: mode=strict, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                                     

python/ray/tests/test_actor_failures.py::test_exit_actor_queued 2025-10-02 02:41:12,951	INFO worker.py:2004 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(raylet) Warning: More than 5000 tasks are pending submission to actor eea8dc9a62b8cb8de3bdff5f01000000. To reduce memory usage, wait for these tasks to finish before sending more.
FAILED(raylet) Warning: More than 10000 tasks are pending submission to actor c87d9a0c6623219fd5c695fc01000000. To reduce memory usage, wait for these tasks to finish before sending more.


============================================================================================== FAILURES ==============================================================================================
_______________________________________________________________________________________ test_exit_actor_queued _______________________________________________________________________________________

shutdown_only = None

    def test_exit_actor_queued(shutdown_only):
        """Verify after exit_actor is called the queued tasks won't execute."""
    
        @ray.remote
        class RegressionSync:
            def f(self):
                import time
    
                time.sleep(1)
                exit_actor()
    
            def ping(self):
                pass
    
        @ray.remote
        class RegressionAsync:
            async def f(self):
                await asyncio.sleep(1)
                exit_actor()
    
            def ping(self):
                pass
    
        # Test async case.
        # https://github.com/ray-project/ray/issues/32376
        # If we didn't fix this issue, this will segfault.
        a = RegressionAsync.remote()
        a.f.remote()
        refs = [a.ping.remote() for _ in range(10000)]
        with pytest.raises(ray.exceptions.RayActorError
        ) as exc_info:
            ray.get(refs)
        assert " Worker unexpectedly exits" not in str(exc_info.value)
    
        # Test a sync case.
        a = RegressionSync.remote()
        a.f.remote()
        with pytest.raises(ray.exceptions.RayActorError
        ) as exc_info:
>           ray.get([a.ping.remote() for _ in range(10000)])

python/ray/tests/test_actor_failures.py:1196: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
python/ray/_private/auto_init_hook.py:22: in auto_init_wrapper
    return fn(*args, **kwargs)
python/ray/_private/client_mode_hook.py:104: in wrapper
    return func(*args, **kwargs)
python/ray/_private/worker.py:2962: in get
    values, debugger_breakpoint = worker.get_objects(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ray._private.worker.Worker object at 0x7f77e2a43730>
object_refs = [ObjectRef(9078e25b3cd9207ac87d9a0c6623219fd5c695fc0100000001000000), ObjectRef(f8570aca6e94fb7ac87d9a0c6623219fd5c695...4f4c87d9a0c6623219fd5c695fc0100000001000000), ObjectRef(852ddd9a96c844dcc87d9a0c6623219fd5c695fc0100000001000000), ...]
timeout = None, return_exceptions = False, skip_deserialization = False, _tensor_transport = None

    def get_objects(
        self,
        object_refs: list,
        timeout: Optional[float] = None,
        return_exceptions: bool = False,
        skip_deserialization: bool = False,
        _tensor_transport: Optional[str] = None,
    ) -> Tuple[List[serialization.SerializedRayObject], bytes]:
        """Get the values in the object store associated with the IDs.
    
        Return the values from the local object store for object_refs. This
        will block until all the values for object_refs have been written to
        the local object store.
    
        Args:
            object_refs: A list of the object refs
                whose values should be retrieved.
            timeout: The maximum amount of time in
                seconds to wait before returning.
            return_exceptions: If any of the objects deserialize to an
                Exception object, whether to return them as values in the
                returned list. If False, then the first found exception will be
                raised.
            skip_deserialization: If true, only the buffer will be released and
                the object associated with the buffer will not be deserialized.
            _tensor_transport: [Alpha] The tensor transport to use to fetch `torch.Tensors` found in the Ray Direct Transport object. Currently, this supports "object_store" and "nixl".
        Returns:
            list: List of deserialized objects or None if skip_deserialization is True.
            bytes: UUID of the debugger breakpoint we should drop
                into or b"" if there is no breakpoint.
        """
        # Make sure that the values are object refs.
        for object_ref in object_refs:
            if not isinstance(object_ref, ObjectRef):
                raise TypeError(
                    f"Attempting to call `get` on the value {object_ref}, "
                    "which is not an ray.ObjectRef."
                )
        tensor_transport: TensorTransportEnum = (
            TensorTransportEnum.from_str(_tensor_transport)
            if _tensor_transport is not None
            else None
        )
        assert tensor_transport in [
            TensorTransportEnum.OBJECT_STORE,
            TensorTransportEnum.NIXL,
            None,
        ], "Currently, RDT only supports 'object_store' and 'nixl' for tensor transport in ray.get()."
        timeout_ms = (
            int(timeout * 1000) if timeout is not None and timeout != -1 else -1
        )
        serialized_objects: List[
            serialization.SerializedRayObject
        ] = self.core_worker.get_objects(
            object_refs,
            timeout_ms,
        )
    
        debugger_breakpoint = b""
        for data, metadata, _ in serialized_objects:
            if metadata:
                metadata_fields = metadata.split(b",")
                if len(metadata_fields) >= 2 and metadata_fields[1].startswith(
                    ray_constants.OBJECT_METADATA_DEBUG_PREFIX
                ):
                    debugger_breakpoint = metadata_fields[1][
                        len(ray_constants.OBJECT_METADATA_DEBUG_PREFIX) :
                    ]
        if skip_deserialization:
            return None, debugger_breakpoint
    
        values = self.deserialize_objects(
            serialized_objects, object_refs, tensor_transport_hint=tensor_transport
        )
        if not return_exceptions:
            # Raise exceptions instead of returning them to the user.
            for i, value in enumerate(values):
                if isinstance(value, RayError):
                    if isinstance(value, ray.exceptions.ObjectLostError):
                        global_worker.core_worker.log_plasma_usage()
                    if isinstance(value, RayTaskError):
                        raise value.as_instanceof_cause()
                    else:
>                       raise value
E                       ray.exceptions.TaskCancelledError: The task aa9c16e4326c9c3ec87d9a0c6623219fd5c695fc01000000 is canceled from an actor c87d9a0c6623219fd5c695fc01000000 before it executes.

python/ray/_private/worker.py:1028: TaskCancelledError
====================================================================================== short test summary info =======================================================================================
FAILED python/ray/tests/test_actor_failures.py::test_exit_actor_queued - ray.exceptions.TaskCancelledError: The task aa9c16e4326c9c3ec87d9a0c6623219fd5c695fc01000000 is canceled from an actor c87d9a0c6623219fd5c695fc01000000 before it executes.

And here is the automated issue that flagged some failing postmerge tests: #44058 (comment)

codope added 2 commits October 3, 2025 06:33
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…n the RPC response itself

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope force-pushed the task-cancellation-exception branch from 0474ea9 to 8806b54 Compare October 3, 2025 06:34
@codope
Copy link
Contributor Author

codope commented Oct 3, 2025

Let's hold on from merging this PR. I just noticed that the test_actor_failures did run successfully in premerge for #56159 -- https://buildkite.com/ray-project/premerge/builds/50170#01999b55-e7a4-4f25-98bc-46c736f9c623/186-4343. And I also ran the test on devbox after reverting #57083, it passed. So, the original premise of #57083 might not be correct.

@Sparks0219 can you link the postmerge failure where test_actor_failures::test_exit_actor_queued failed? Or if you were able to repro locally, then the logs would be super helpful.

Discussed offline. There was a difference in my test env (devbox is using different instance type and i think instance type in ci also got updated recently). The issue is legit and reproducible. I have addressed all the other comments.

@jjyao jjyao requested a review from edoakes October 3, 2025 17:23
@jjyao
Copy link
Collaborator

jjyao commented Oct 3, 2025

The issue is legit and reproducible.

@codope have you fixed the issue yet?

@israbbani
Copy link
Contributor

Let's hold on from merging this PR. I just noticed that the test_actor_failures did run successfully in premerge for #56159 -- https://buildkite.com/ray-project/premerge/builds/50170#01999b55-e7a4-4f25-98bc-46c736f9c623/186-4343. And I also ran the test on devbox after reverting #57083, it passed. So, the original premise of #57083 might not be correct.
@Sparks0219 can you link the postmerge failure where test_actor_failures::test_exit_actor_queued failed? Or if you were able to repro locally, then the logs would be super helpful.

Discussed offline. There was a difference in my test env (devbox is using different instance type and i think instance type in ci also got updated recently). The issue is legit and reproducible. I have addressed all the other comments.

I have a few questions about this

  • Why did the difference in instance type cause the test results to be flaky?
  • If this test runs on premerge, how did it pass in the first place?

@codope
Copy link
Contributor Author

codope commented Oct 4, 2025

@codope have you fixed the issue yet?

@jjyao yes this PR fixes the issue.

@codope
Copy link
Contributor Author

codope commented Oct 4, 2025

Why did the difference in instance type cause the test results to be flaky?

I am not sure but this is obviously a sign of timing-dependent issues. The current fix uses reply.worker_exiting(), which travels atomically with the RPC response so there is no race condition.

If this test runs on premerge, how did it pass in the first place?

I think between the premerge and postmerge, the CI env must have changed.

…rwise, add the task to wait_for_death_info_tasks_ to wait for the death notification from GCS within a grace period

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@jjyao
Copy link
Collaborator

jjyao commented Oct 6, 2025

Test failures

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@jjyao jjyao requested a review from israbbani October 7, 2025 04:17
Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left one comment. If we need to unblock v2.50, we can merge this and do that separately. @edoakes / @jjyao

Comment on lines +788 to +789
auto queue_pair = client_queues_.find(actor_id);
if (queue_pair != client_queues_.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a RAY_CHECK failure in the case where the actor is not dead. I don't like adding more RAY_CHECKs, but I don't see how we can recover if we're handling a PushTaskReply and the actor doesn't have a client_queue_.

@edoakes
Copy link
Collaborator

edoakes commented Oct 7, 2025

LGTM. I left one comment. If we need to unblock v2.50, we can merge this and do that separately. @edoakes / @jjyao

will do this to keep in @aslonnie's good graces

@edoakes edoakes merged commit f9cb400 into ray-project:master Oct 7, 2025
6 checks passed
aslonnie added a commit that referenced this pull request Oct 8, 2025
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
aslonnie added a commit that referenced this pull request Oct 8, 2025
…ath (#57538)

cherrypick #57095

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes #57092

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…ath (ray-project#57095)

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes ray-project#57092

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…ath (ray-project#57095)

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes ray-project#57092

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ath (ray-project#57095)

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes ray-project#57092

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ath (ray-project#57095)

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes ray-project#57092

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ath (ray-project#57095)

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes ray-project#57092

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ath (ray-project#57095)

Check if task cancellation is due to actor shutdown or explicit user
cancellation. Actor shutdown should raise RayActorError, not
TaskCancelledError.

Closes ray-project#57092

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[core] Tasks rejected during graceful shutdown should raise RayActorError

5 participants