Skip to content

Conversation

@Qiaolin-Yu
Copy link
Member

Why are these changes needed?

Previously, we didn't deregister the tensor from nixl_agent after the tensor has been freed, this will cause error in nixl agent.
In this pr, we invalidate the metadata after the tensor is freed.

Note that this pr is based on #57112, so it should be merged after that.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Qiaolin-Yu Qiaolin-Yu requested a review from a team as a code owner October 2, 2025 21:00
@Qiaolin-Yu Qiaolin-Yu added the rdt Ray Direct Transport label Oct 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the GPU object freeing mechanism by replacing the FreeActorObject RPC with a callback-based approach, which is a good simplification. It also introduces a garbage_collect method to the tensor transport interface to handle backend-specific cleanup, specifically for nixl to deregister memory. The changes are well-structured, but I've identified a potential AttributeError for non-NIXL transports and a minor improvement for exception handling. My review includes suggestions to address these points.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@Qiaolin-Yu Qiaolin-Yu changed the title Fix nixl garbage collection in RDT [core][RDT] Fix nixl garbage collection after the object is freed Oct 2, 2025
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 3, 2025
Signed-off-by: dayshah <dhyey2019@gmail.com>

nixl_agent.release_xfer_handle(xfer_handle)
nixl_agent.deregister_memory(local_descs)
nixl_agent.remove_remote_agent(remote_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little confused on why we would remove the agent but not deregister the memory if the send is sync

we'll re-register the memory on every send anyways

Copy link
Member Author

@Qiaolin-Yu Qiaolin-Yu Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deregister_memory should be called by the same agent which calls register_memory. In our case, it should be called by the sender, I added it in the gc function.

@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Oct 5, 2025
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good assuming it passes manual testing for now :s

I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Contributor

dayshah commented Oct 7, 2025

I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?

Confused on what the race is, the NIXL recv is sync right, so as long as the borrower / consumer isn't done with the ref we won't GC/deregister on the sender.

@stephanie-wang
Copy link
Contributor

I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?

Confused on what the race is, the NIXL recv is sync right, so as long as the borrower / consumer isn't done with the ref we won't GC/deregister on the sender.

Ah as I understood it, the bug here is a bit different from the NCCL one. It happens because NIXL doesn't allow the same memory region to be registered twice. It was happening if we allocate a tensor on the sender, register it, then before we can GC it, torch allocates the same physical memory to another tensor.

But actually I think I was wrong, seems like this should work because torch shouldn't allocate the memory to another tensor until GC runs. So ignore me :D

@dayshah dayshah merged commit f8732a1 into ray-project:master Oct 7, 2025
6 checks passed
aslonnie added a commit that referenced this pull request Oct 8, 2025
cherrypick #57247 #57253 #57138

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
aslonnie added a commit that referenced this pull request Oct 8, 2025
cherrypick #57247 #57253 #57138

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
@Qiaolin-Yu Qiaolin-Yu deleted the fix_nixl branch October 9, 2025 16:36
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…y-project#57138)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…y-project#57138)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…y-project#57138)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…y-project#57138)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…y-project#57138)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…y-project#57138)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests rdt Ray Direct Transport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants