-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[core][RDT] Fix data race when using async gpu to gpu transfer #57112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the mechanism for freeing GPU objects. It replaces the FreeActorObject C++ RPC with a Python-level RPC (__ray_free__) initiated from the object owner's process. This change centralizes the logic for freeing GPU objects in the Python GPUObjectManager.
While the overall direction of the refactoring is sound, the current implementation appears to be a work in progress. There are several debugging print statements that should be replaced with proper logging. More importantly, the core logic in __ray_free__ is commented out, which would prevent GPU objects from being freed. Additionally, there's an overly broad exception handler that could mask potential bugs.
I've left specific comments on these points. Please address them to ensure the new mechanism is robust and production-ready.
python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Outdated
Show resolved
Hide resolved
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, some nits but generally lgtm, you can't repro the crash anymore right?
I'm wondering if we could write a test for this but i'm not really sure how...
python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Outdated
Show resolved
Hide resolved
python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Outdated
Show resolved
Hide resolved
| p2p_fn(tensor, comms[i], streams[i], peer_p2p_rank) | ||
| # Record the stream to avoid tensor being freed before the send/recv is completed. | ||
| torch_stream = torch.cuda.ExternalStream(streams[i].ptr) | ||
| tensor.record_stream(torch_stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't love that this is so deep in the stack, but i guess it doesn't apply to gloo / nixl so it needs to be here?
Maybe we could do something like have a tensor_transport_manager.wait_for_stream_on_send
and then wait on it until we finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is we can't get the stream on the tensor_transport_manager level, since the stream is maintained in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would also be fixed if we implement @dayshah's proposed change of just implementing a nccl backend directly on cupy instead of going through ray.util.collective.
Let's keep it for now and add a NOTE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually on second thought, I think we can just keep this. The same bug can appear just in ray.util.collective code too.
| src_actor.__ray_call__.options(concurrency_group="_ray_system").remote( | ||
| __ray_free__, object_id | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this except exception, in what cases will it get hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be useful after we add gc for gpu object metadata. For that case, managed_gpu_object_metadata[object_id] may have keyerror if the object metadata has been cleaned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that makes sense, can you just do something where you do self.managed_gpu_object_metadata.get instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's log the error for now. It's easy for these kinds of codepaths to fail in unexpected ways and we don't want to bring down the whole application from a bad assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, can log a something went wrong on freeing error + exception rn, should also do the get(, None) for metadata when we implement metadata cleanup
Yes, I can't reproduce the issue with the previous script now. @dayshah |
stephanie-wang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiice!
Yes that sounds good. Maybe there is a way to mock it. Otherwise we could try to launch a blocking kernel on the send stream...? |
Signed-off-by: dayshah <dhyey2019@gmail.com>
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed test build + logging the exception now
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
…roject#57112) In this pr, we aim to do two main changes 1. move the gc task for gpu objects to _ray_system thread (the same thread as ray_send and ray_recv) to control the execution order. 2. use `torch.Tensor.record_stream` to record the send stream, make sure the tensor will not be freed before finishing the send task. Signed-off-by: dayshah <dhyey2019@gmail.com> Co-authored-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Why are these changes needed?
Previously, there's a possible data race pattern like this:
In this pr, we aim to do two main changes
torch.Tensor.record_streamto record the send stream, make sure the tensor will not be freed before finishing the send task.Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Plumbs a new FreeActorObject flow: TaskManager triggers a callback to RPC the source actor to free a GPU object’s primary copy, with client/server handlers and a Python GPUObjectManager entry point.
TaskManagerfree callback (FreeActorObjectCallback) and pass fromCoreWorkerProcesstoTaskManager.TaskManagerinvokes callback instead of inlined RPC logic._raylet.pyxnow callsgpu_object_manager.free_object_primary_copy(...).GPUObjectManager.free_object_primary_copyadded; calls src actor__ray_free__(stub ingpu_object_store.py) via_ray_systemconcurrency group.FreeActorObjectRPC incore_worker.protoand gRPC service (grpc_service.{h,cc}), proxy (core_worker_rpc_proxy.h).CoreWorkerClient{,Interface}and fake client; update mocks to drop old paths.pop_objectoutput).Written by Cursor Bugbot for commit db5243d. This will update automatically on new commits. Configure here.