[core][RDT] Fix data race when using async gpu to gpu transfer #57112

Qiaolin-Yu · 2025-10-01T23:15:31Z

Why are these changes needed?

Previously, there's a possible data race pattern like this:

1. add object xxx to store 
2. async ray_send xxx, not actually finished
3. async ray_recv xxx 
4. free_object_primary_copy xxx since it's out of scope
5. async ray_send fails during running (xxx does not exist in object store)

In this pr, we aim to do two main changes

move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order.
use torch.Tensor.record_stream to record the send stream, make sure the tensor will not be freed before finishing the send task.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Plumbs a new FreeActorObject flow: TaskManager triggers a callback to RPC the source actor to free a GPU object’s primary copy, with client/server handlers and a Python GPUObjectManager entry point.

GPU Object Freeing (Primary Copy):
- Add TaskManager free callback (FreeActorObjectCallback) and pass from CoreWorkerProcess to TaskManager.
- On GPU-object out-of-scope, TaskManager invokes callback instead of inlined RPC logic.
- Python: _raylet.pyx now calls gpu_object_manager.free_object_primary_copy(...).
- Python: GPUObjectManager.free_object_primary_copy added; calls src actor __ray_free__ (stub in gpu_object_store.py) via _ray_system concurrency group.
RPC/Plumbing:
- Define and expose FreeActorObject RPC in core_worker.proto and gRPC service (grpc_service.{h,cc}), proxy (core_worker_rpc_proxy.h).
- Implement client methods in CoreWorkerClient{,Interface} and fake client; update mocks to drop old paths.
Refactors/Logging:
- Remove old inlined free-object paths; route through new callback/RPC.
- Minor debug logs (e.g., pop_object output).

^{Written by Cursor Bugbot for commit db5243d. This will update automatically on new commits. Configure here.}

Signed-off-by: dayshah <dhyey2019@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the mechanism for freeing GPU objects. It replaces the FreeActorObject C++ RPC with a Python-level RPC (__ray_free__) initiated from the object owner's process. This change centralizes the logic for freeing GPU objects in the Python GPUObjectManager.

While the overall direction of the refactoring is sound, the current implementation appears to be a work in progress. There are several debugging print statements that should be replaced with proper logging. More importantly, the core logic in __ray_free__ is commented out, which would prevent GPU objects from being freed. Additionally, there's an overly broad exception handler that could mask potential bugs.

I've left specific comments on these points. Please address them to ensure the new mechanism is robust and production-ready.

python/ray/experimental/gpu_object_manager/gpu_object_store.py

python/ray/experimental/gpu_object_manager/gpu_object_manager.py

python/ray/experimental/gpu_object_manager/gpu_object_store.py

dayshah

awesome, some nits but generally lgtm, you can't repro the crash anymore right?

I'm wondering if we could write a test for this but i'm not really sure how...

python/ray/experimental/gpu_object_manager/gpu_object_manager.py

python/ray/experimental/gpu_object_manager/gpu_object_store.py

dayshah · 2025-10-02T21:12:58Z

python/ray/util/collective/collective_group/nccl_collective_group.py

            p2p_fn(tensor, comms[i], streams[i], peer_p2p_rank)
+            # Record the stream to avoid tensor being freed before the send/recv is completed.
+            torch_stream = torch.cuda.ExternalStream(streams[i].ptr)
+            tensor.record_stream(torch_stream)


i don't love that this is so deep in the stack, but i guess it doesn't apply to gloo / nixl so it needs to be here?

Maybe we could do something like have a tensor_transport_manager.wait_for_stream_on_send
and then wait on it until we finish.

I think the issue is we can't get the stream on the tensor_transport_manager level, since the stream is maintained in this file.

Yes, I think this would also be fixed if we implement @dayshah's proposed change of just implementing a nccl backend directly on cupy instead of going through ray.util.collective.

Let's keep it for now and add a NOTE?

Actually on second thought, I think we can just keep this. The same bug can appear just in ray.util.collective code too.

dayshah · 2025-10-02T21:28:59Z

python/ray/experimental/gpu_object_manager/gpu_object_manager.py

+            src_actor.__ray_call__.options(concurrency_group="_ray_system").remote(
+                __ray_free__, object_id
+            )
+        except Exception:


Do we need this except exception, in what cases will it get hit?

I think it will be useful after we add gc for gpu object metadata. For that case, managed_gpu_object_metadata[object_id] may have keyerror if the object metadata has been cleaned?

Ya that makes sense, can you just do something where you do self.managed_gpu_object_metadata.get instead?

Let's log the error for now. It's easy for these kinds of codepaths to fail in unexpected ways and we don't want to bring down the whole application from a bad assert.

makes sense, can log a something went wrong on freeing error + exception rn, should also do the get(, None) for metadata when we implement metadata cleanup

Qiaolin-Yu · 2025-10-02T22:25:00Z

awesome, some nits but generally lgtm, you can't repro the crash anymore right?

Yes, I can't reproduce the issue with the previous script now. @dayshah
Maybe we can add tests in the future. I need some time to figure out how to test it.

stephanie-wang

Niiice!

stephanie-wang · 2025-10-02T23:05:55Z

awesome, some nits but generally lgtm, you can't repro the crash anymore right?

Yes, I can't reproduce the issue with the previous script now. @dayshah Maybe we can add tests in the future. I need some time to figure out how to test it.

Yes that sounds good. Maybe there is a way to mock it. Otherwise we could try to launch a blocking kernel on the send stream...?

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah

fixed test build + logging the exception now

…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

temp

db5243d

Qiaolin-Yu requested a review from a team as a code owner October 1, 2025 23:15

Qiaolin-Yu marked this pull request as draft October 1, 2025 23:15

fix deadlock

2963e99

Signed-off-by: dayshah <dhyey2019@gmail.com>

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

Qiaolin-Yu added 3 commits October 2, 2025 00:08

upd

b96f79f

upd

57a15e0

refine

8b8ce0d

Qiaolin-Yu changed the title ~~draft~~ Fix data race in RDT when using async gpu to gpu transfer Oct 2, 2025

Qiaolin-Yu marked this pull request as ready for review October 2, 2025 20:25

Qiaolin-Yu assigned dayshah and stephanie-wang Oct 2, 2025

Qiaolin-Yu requested review from dayshah and stephanie-wang October 2, 2025 20:26

This comment was marked as outdated.

Sign in to view

Qiaolin-Yu mentioned this pull request Oct 2, 2025

[core][RDT] Fix nixl garbage collection after the object is freed #57138

Merged

8 tasks

Qiaolin-Yu added the rdt Ray Direct Transport label Oct 2, 2025

Merge branch 'master' into fix_rdt

fe98b29

This comment was marked as outdated.

Sign in to view

dayshah reviewed Oct 2, 2025

View reviewed changes

Qiaolin-Yu changed the title ~~Fix data race in RDT when using async gpu to gpu transfer~~ [core][RDT] Fix data race when using async gpu to gpu transfer Oct 2, 2025

upd

db49613

Qiaolin-Yu requested a review from dayshah October 2, 2025 22:34

stephanie-wang approved these changes Oct 2, 2025

View reviewed changes

ray-gardener bot added core Issues that should be addressed in Ray Core gpu GPU related issues labels Oct 3, 2025

log exception + fix cpp test builds

539cc2c

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah added the go add ONLY when ready to merge, run all tests label Oct 3, 2025

Merge branch 'master' into fix_rdt

2ce8ada

dayshah enabled auto-merge (squash) October 3, 2025 19:19

dayshah approved these changes Oct 3, 2025

View reviewed changes

dayshah merged commit ca92d11 into ray-project:master Oct 3, 2025
6 of 7 checks passed

[core][RDT] Fix data race when using async gpu to gpu transfer #57112

[core][RDT] Fix data race when using async gpu to gpu transfer #57112

Uh oh!

Conversation

Qiaolin-Yu commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Oct 2, 2025

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qiaolin-Yu commented Oct 1, 2025 •

edited

Loading

dayshah Oct 2, 2025 •

edited

Loading

Qiaolin-Yu commented Oct 2, 2025 •

edited

Loading