-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[core][RDT] Fix nixl garbage collection after the object is freed #57138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the GPU object freeing mechanism by replacing the FreeActorObject RPC with a callback-based approach, which is a good simplification. It also introduces a garbage_collect method to the tensor transport interface to handle backend-specific cleanup, specifically for nixl to deregister memory. The changes are well-structured, but I've identified a potential AttributeError for non-NIXL transports and a minor improvement for exception handling. My review includes suggestions to address these points.
python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Outdated
Show resolved
Hide resolved
|
|
||
| nixl_agent.release_xfer_handle(xfer_handle) | ||
| nixl_agent.deregister_memory(local_descs) | ||
| nixl_agent.remove_remote_agent(remote_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a little confused on why we would remove the agent but not deregister the memory if the send is sync
we'll re-register the memory on every send anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deregister_memory should be called by the same agent which calls register_memory. In our case, it should be called by the sender, I added it in the gc function.
stephanie-wang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good assuming it passes manual testing for now :s
I just realized, though - I think there can still be a race condition since the GC runs asynchronously. Could try to fix it with a wait_tensor_free call when the user returns a tensor with NIXL transport?
Signed-off-by: dayshah <dhyey2019@gmail.com>
Confused on what the race is, the NIXL recv is sync right, so as long as the borrower / consumer isn't done with the ref we won't GC/deregister on the sender. |
Ah as I understood it, the bug here is a bit different from the NCCL one. It happens because NIXL doesn't allow the same memory region to be registered twice. It was happening if we allocate a tensor on the sender, register it, then before we can GC it, torch allocates the same physical memory to another tensor. But actually I think I was wrong, seems like this should work because torch shouldn't allocate the memory to another tensor until GC runs. So ignore me :D |
…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…y-project#57138) Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Why are these changes needed?
Previously, we didn't deregister the tensor from nixl_agent after the tensor has been freed, this will cause error in nixl agent.
In this pr, we invalidate the metadata after the tensor is freed.
Note that this pr is based on #57112, so it should be merged after that.
Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.