[core][RDT] Fix nixl garbage collection after the object is freed #57138

Qiaolin-Yu · 2025-10-02T21:00:57Z

Why are these changes needed?

Previously, we didn't deregister the tensor from nixl_agent after the tensor has been freed, this will cause error in nixl agent.
In this pr, we invalidate the metadata after the tensor is freed.

Note that this pr is based on #57112, so it should be merged after that.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: dayshah <dhyey2019@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the GPU object freeing mechanism by replacing the FreeActorObject RPC with a callback-based approach, which is a good simplification. It also introduces a garbage_collect method to the tensor transport interface to handle backend-specific cleanup, specifically for nixl to deregister memory. The changes are well-structured, but I've identified a potential AttributeError for non-NIXL transports and a minor improvement for exception handling. My review includes suggestions to address these points.

python/ray/experimental/gpu_object_manager/gpu_object_store.py

python/ray/experimental/collective/nixl_tensor_transport.py

python/ray/experimental/gpu_object_manager/gpu_object_manager.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2025-10-04T08:20:35Z

python/ray/util/collective/collective_group/nixl_backend.py


        nixl_agent.release_xfer_handle(xfer_handle)
-        nixl_agent.deregister_memory(local_descs)
+        nixl_agent.remove_remote_agent(remote_name)


i'm a little confused on why we would remove the agent but not deregister the memory if the send is sync

we'll re-register the memory on every send anyways

The deregister_memory should be called by the same agent which calls register_memory. In our case, it should be called by the sender, I added it in the gc function.

stephanie-wang

This looks good assuming it passes manual testing for now :s

I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2025-10-07T08:18:34Z

I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?

Confused on what the race is, the NIXL recv is sync right, so as long as the borrower / consumer isn't done with the ref we won't GC/deregister on the sender.

stephanie-wang · 2025-10-07T16:47:28Z

I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?

Confused on what the race is, the NIXL recv is sync right, so as long as the borrower / consumer isn't done with the ref we won't GC/deregister on the sender.

Ah as I understood it, the bug here is a bit different from the NCCL one. It happens because NIXL doesn't allow the same memory region to be registered twice. It was happening if we allocate a tensor on the sender, register it, then before we can GC it, torch allocates the same physical memory to another tensor.

But actually I think I was wrong, seems like this should work because torch shouldn't allocate the memory to another tensor until GC runs. So ignore me :D

cherrypick #57247 #57253 #57138 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

Qiaolin-Yu and others added 8 commits October 1, 2025 19:20

temp

db5243d

fix deadlock

2963e99

Signed-off-by: dayshah <dhyey2019@gmail.com>

upd

b96f79f

temp

2d04c30

upd

57a15e0

refine

8b8ce0d

Merge branch 'fix_rdt' into fix_nixl

63394b8

refine

defe92b

Qiaolin-Yu requested a review from a team as a code owner October 2, 2025 21:00

Qiaolin-Yu assigned dayshah, Qiaolin-Yu and stephanie-wang Oct 2, 2025

Qiaolin-Yu requested review from dayshah and stephanie-wang October 2, 2025 21:01

Qiaolin-Yu added the rdt Ray Direct Transport label Oct 2, 2025

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

python/ray/experimental/gpu_object_manager/gpu_object_store.py Outdated Show resolved Hide resolved

python/ray/experimental/collective/nixl_tensor_transport.py Outdated Show resolved Hide resolved

python/ray/experimental/gpu_object_manager/gpu_object_manager.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Qiaolin-Yu added 3 commits October 2, 2025 14:19

Merge branch 'master' into fix_nixl

fbae7b6

fix

63f80dd

fix

099652c

This comment was marked as outdated.

Sign in to view

refine

631b56a

This comment was marked as outdated.

Sign in to view

Qiaolin-Yu changed the title ~~Fix nixl garbage collection in RDT~~ [core][RDT] Fix nixl garbage collection after the object is freed Oct 2, 2025

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 3, 2025

temp

950c536

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah reviewed Oct 4, 2025

View reviewed changes

dayshah added the go add ONLY when ready to merge, run all tests label Oct 5, 2025

stephanie-wang approved these changes Oct 7, 2025

View reviewed changes

Merge branch 'master' into fix_nixl

3fa8c22

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah approved these changes Oct 7, 2025

View reviewed changes

dayshah merged commit f8732a1 into ray-project:master Oct 7, 2025
6 checks passed

aslonnie added a commit that referenced this pull request Oct 8, 2025

core code cherrypicks

8e5b203

cherrypick #57247 #57253 #57138 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

aslonnie mentioned this pull request Oct 8, 2025

more core code cherrypicks #57557

Merged

aslonnie added a commit that referenced this pull request Oct 8, 2025

more core code cherrypicks (#57557)

e46fe7e

cherrypick #57247 #57253 #57138 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Qiaolin-Yu deleted the fix_nixl branch October 9, 2025 16:36

liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025

[core][RDT] Fix nixl garbage collection after the object is freed (ra…

242eb99

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025

[core][RDT] Fix nixl garbage collection after the object is freed (ra…

fee3107

…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core][RDT] Fix nixl garbage collection after the object is freed #57138

[core][RDT] Fix nixl garbage collection after the object is freed #57138

Uh oh!

Qiaolin-Yu commented Oct 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

dayshah Oct 4, 2025

Uh oh!

Qiaolin-Yu Oct 6, 2025 •

edited

Loading

Uh oh!

stephanie-wang left a comment

Uh oh!

dayshah commented Oct 7, 2025 •

edited

Loading

Uh oh!

stephanie-wang commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[core][RDT] Fix nixl garbage collection after the object is freed #57138

[core][RDT] Fix nixl garbage collection after the object is freed #57138

Uh oh!

Conversation

Qiaolin-Yu commented Oct 2, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

dayshah Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

dayshah commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanie-wang commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qiaolin-Yu Oct 6, 2025 •

edited

Loading

dayshah commented Oct 7, 2025 •

edited

Loading