-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store and set the correct CUDA device in device_buffer #1370
Store and set the correct CUDA device in device_buffer #1370
Conversation
Maybe just get rid of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few non-blocking questions.
cuda_set_device_raii dev{_device}; | ||
allocate_async(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Should the setting of the current device live inside allocate/deallocate
rather than it being the responsibility of the caller to ensure the device is correct? Or, is this deliberate because we might want more than just the allocate call to occur with the same device active and this approach avoids excessive device switching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's deliberate. I wanted to put it in allocate_async
/deallocate_async
, but those calls are often made in places where the correct device is also needed for other operations, and we don't want to cuda_set_device_raii
multiple times. There are also places such as resize / shrink_to_fit
where a new device_buffer
is created and we want that to happen with the original device active, but inside it we call allocate_async
and that would cause redundant current device checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I've arrived at is that in order to minimize device switching, we want to do it at the highest level in device_buffer
possible, which means the public API functions (when necessary). For the same reason, we assume the user has set the device before constructing the device_buffer
, and we just store the ID at that stage.
cuda_set_device_raii dev{_device}; | ||
auto tmp = device_buffer{new_capacity, stream, _mr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Does the appearance of this pattern suggest that the device_buffer
constructor should have an (optional) device
argument that one can provide, rather than relying on the implicit current cuda device (which is then managed by this raii object here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did that, then if we are eliminating the cuda_set_device_raii
here, then the constructor would have to first call cudaSetDevice(device)
, and I assume it would do so using cuda_set_device_raii
, which means on exiting the ctor the previous device would be reset (if different).
So then we would need to call cuda_set_device_raii
again after calling the constructor with the optional device
argument because of the subsequent cudaMemcpyAsync
. That could mean two calls to cudaGetDevice
and four calls to cudaSetDevice
, worst case. The way it is now, there is at most 1 cudaGetDevice
and at most 2 cudaSetDevice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my understanding from the docs was that runtime calls (excepting [some] of those to do with events, where the call has to happen with the live device matching the event's stream) don't care about the current device and hence allocation/deallocation (which, with a pool mr record events) are the only places we need to handle it.
@jrhemstad @wence- do either of you want to opine on the existential question I asked in the description of this PR? |
I presume you mean:
I think my preference is to mark as Edit: I think Jake is also in favour of removal: #1370 (comment) |
Yes, I did a lot of that eradication work. But we didn't eliminate all device_vector from rapids, especially in tests. In fact a search shows that cuGraph still uses I actually think an initialized vector is useful, as long as you know about its synchronizing behavior. So I don't really want to remove But I guess what you are saying is that you think it's OK for rmm::device_uvector and rmm::device_vector to have different semantics. I agree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please edit the PR to description to summarise the outcome around device_vector
(rather than mentioning it as an issue to resolve).
Done. |
/merge |
This PR removes static checks for serialization size. Upstream changes like rapidsai/rmm#1370 have altered these sizes and break RAFT CI. An alternative approach to verifying serialization will be developed. Authors: - Corey J. Nolet (https://github.com/cjnolet) - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Mark Harris (https://github.com/harrism) URL: #1997
Since rapidsai#1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation. - Closes rapidsai#1523
Since rapidsai#1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation. - Closes rapidsai#1523
…#1524) Since #1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation. - Closes #1523 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Mark Harris (https://github.com/harrism) URL: #1524
Description
This changes
device_buffer
to store the active CUDA device ID on creation, and (possibly temporarily) set the active device to that ID before allocating or freeing memory. It also adds tests for containers built ondevice_buffer
(device_buffer
,device_uvector
anddevice_scalar
) that ensure correct operation when the device is changed before doing things that alloc/dealloc memory for those containers.This fixes #1342 . HOWEVER, there is an important question yet to answer:
rmm::device_vector
is just an alias forthrust::device_vector
, which does not usermm::device_buffer
for storage. However users may be surprised after this PR because the multidevice semantics of RMM containers will be different fromthrust::device_vector
(and thereforermm::device_vector
).Update: opinion is that it's probably OK to diverge from
device_vector
, and some think we should removermm::device_vector
.While we discuss this I have set the DO NOT MERGE label.Checklist