Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store and set the correct CUDA device in device_buffer #1370

Merged

Conversation

harrism
Copy link
Member

@harrism harrism commented Nov 2, 2023

Description

This changes device_buffer to store the active CUDA device ID on creation, and (possibly temporarily) set the active device to that ID before allocating or freeing memory. It also adds tests for containers built on device_buffer (device_buffer, device_uvector and device_scalar) that ensure correct operation when the device is changed before doing things that alloc/dealloc memory for those containers.

This fixes #1342 . HOWEVER, there is an important question yet to answer:

rmm::device_vector is just an alias for thrust::device_vector, which does not use rmm::device_buffer for storage. However users may be surprised after this PR because the multidevice semantics of RMM containers will be different from thrust::device_vector (and therefore rmm::device_vector).

Update: opinion is that it's probably OK to diverge from device_vector, and some think we should remove rmm::device_vector.

While we discuss this I have set the DO NOT MERGE label.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@harrism harrism added feature request New feature or request non-breaking Non-breaking change 5 - DO NOT MERGE Hold off on merging; see PR for details cpp Pertains to C++ code labels Nov 2, 2023
@harrism harrism requested a review from jrhemstad November 2, 2023 03:08
@harrism harrism self-assigned this Nov 2, 2023
@harrism harrism requested review from a team as code owners November 2, 2023 03:08
@harrism harrism requested a review from vyasr November 2, 2023 03:08
@github-actions github-actions bot added the CMake label Nov 2, 2023
@jrhemstad
Copy link
Contributor

Maybe just get rid of rmm::device_vector? We went out of our way to get rid of all uses of that in favor of device_uvector anyways.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few non-blocking questions.

include/rmm/cuda_device.hpp Outdated Show resolved Hide resolved
include/rmm/device_buffer.hpp Show resolved Hide resolved
Comment on lines +108 to 109
cuda_set_device_raii dev{_device};
allocate_async(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should the setting of the current device live inside allocate/deallocate rather than it being the responsibility of the caller to ensure the device is correct? Or, is this deliberate because we might want more than just the allocate call to occur with the same device active and this approach avoids excessive device switching?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's deliberate. I wanted to put it in allocate_async/deallocate_async, but those calls are often made in places where the correct device is also needed for other operations, and we don't want to cuda_set_device_raii multiple times. There are also places such as resize / shrink_to_fit where a new device_buffer is created and we want that to happen with the original device active, but inside it we call allocate_async and that would cause redundant current device checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I've arrived at is that in order to minimize device switching, we want to do it at the highest level in device_buffer possible, which means the public API functions (when necessary). For the same reason, we assume the user has set the device before constructing the device_buffer, and we just store the ID at that stage.

Comment on lines +270 to 271
cuda_set_device_raii dev{_device};
auto tmp = device_buffer{new_capacity, stream, _mr};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Does the appearance of this pattern suggest that the device_buffer constructor should have an (optional) device argument that one can provide, rather than relying on the implicit current cuda device (which is then managed by this raii object here)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that, then if we are eliminating the cuda_set_device_raii here, then the constructor would have to first call cudaSetDevice(device), and I assume it would do so using cuda_set_device_raii, which means on exiting the ctor the previous device would be reset (if different).

So then we would need to call cuda_set_device_raii again after calling the constructor with the optional device argument because of the subsequent cudaMemcpyAsync. That could mean two calls to cudaGetDevice and four calls to cudaSetDevice, worst case. The way it is now, there is at most 1 cudaGetDevice and at most 2 cudaSetDevice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my understanding from the docs was that runtime calls (excepting [some] of those to do with events, where the call has to happen with the live device matching the event's stream) don't care about the current device and hence allocation/deallocation (which, with a pool mr record events) are the only places we need to handle it.

@harrism
Copy link
Member Author

harrism commented Nov 7, 2023

@jrhemstad @wence- do either of you want to opine on the existential question I asked in the description of this PR?

@wence-
Copy link
Contributor

wence- commented Nov 7, 2023

@jrhemstad @wence- do either of you want to opine on the existential question I asked in the description of this PR?

I presume you mean:

rmm::device_vector is just an alias for thrust::device_vector, which does not use rmm::device_buffer for storage. However users may be surprised after this PR because the multidevice semantics of RMM containers will be different from thrust::device_vector (and therefore rmm::device_vector).

I think my preference is to mark as [[deprecated(...)]] for 23.12 and then remove in 24.02. We can add an example about how to use thrust vectors with an RMM memory resource.

Edit: I think Jake is also in favour of removal: #1370 (comment)

@harrism
Copy link
Member Author

harrism commented Nov 8, 2023

Maybe just get rid of rmm::device_vector? We went out of our way to get rid of all uses of that in favor of device_uvector anyways.

Yes, I did a lot of that eradication work. But we didn't eliminate all device_vector from rapids, especially in tests. In fact a search shows that cuGraph still uses thrust::device_vector, not just rmm::device_vector.

I actually think an initialized vector is useful, as long as you know about its synchronizing behavior. So I don't really want to remove rmm::device_vector.

But I guess what you are saying is that you think it's OK for rmm::device_uvector and rmm::device_vector to have different semantics. I agree.

@harrism harrism removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 8, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please edit the PR to description to summarise the outcome around device_vector (rather than mentioning it as an issue to resolve).

@harrism
Copy link
Member Author

harrism commented Nov 8, 2023

Could you please edit the PR to description to summarise the outcome around device_vector (rather than mentioning it as an issue to resolve).

Done.

include/rmm/device_buffer.hpp Outdated Show resolved Hide resolved
include/rmm/cuda_device.hpp Outdated Show resolved Hide resolved
include/rmm/cuda_device.hpp Outdated Show resolved Hide resolved
include/rmm/cuda_device.hpp Outdated Show resolved Hide resolved
include/rmm/cuda_device.hpp Outdated Show resolved Hide resolved
@harrism
Copy link
Member Author

harrism commented Nov 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit ba99ff4 into rapidsai:branch-23.12 Nov 15, 2023
44 checks passed
@bdice bdice added breaking Breaking change and removed non-breaking Non-breaking change labels Nov 15, 2023
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request Nov 15, 2023
This PR removes static checks for serialization size. Upstream changes like rapidsai/rmm#1370 have altered these sizes and break RAFT CI. An alternative approach to verifying serialization will be developed.

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Mark Harris (https://github.com/harrism)

URL: #1997
wence- added a commit to wence-/rmm that referenced this pull request Apr 10, 2024
Since rapidsai#1370, the dtor for device_buffer ensures that the correct
device is active when the deallocation occurs. We therefore update the
example to discuss this. Since device_vector still requires the user
to manage the active device correctly by hand, call this out
explicitly in the documentation.

- Closes rapidsai#1523
wence- added a commit to wence-/rmm that referenced this pull request Apr 10, 2024
Since rapidsai#1370, the dtor for device_buffer ensures that the correct
device is active when the deallocation occurs. We therefore update the
example to discuss this. Since device_vector still requires the user
to manage the active device correctly by hand, call this out
explicitly in the documentation.

- Closes rapidsai#1523
rapids-bot bot pushed a commit that referenced this pull request Apr 11, 2024
…#1524)

Since #1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation.

- Closes #1523

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Mark Harris (https://github.com/harrism)

URL: #1524
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change CMake cpp Pertains to C++ code feature request New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code
4 participants