Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update multi-gpu discussion for device_buffer and device_vector dtors #1524

Merged
merged 2 commits into from
Apr 11, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 55 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,24 +336,71 @@ for(int i = 0; i < N; ++i) {

Note that the CUDA device that is current when creating a `device_memory_resource` must also be
current any time that `device_memory_resource` is used to deallocate memory, including in a
destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an
(incorrect) example that assumes the above example loop has been run to create a
`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the
line of the error comment.
destructor. The RAII class `rmm::device_buffer` and classes that use
harrism marked this conversation as resolved.
Show resolved Hide resolved
it as a backing store (`rmm::device_scalar` and `rmm::device_uvector`)
handle this by storing the active device when the constructor is
called, and then ensuring that the stored device is active whenever an
allocation or deallocation is performed (including in the destructor).
The user must therefore only ensure that the device active during
_creation_ of an `rmm::device_buffer` matches the active device of the
memory resource being used.

Here is an incorrect example that creates a memory resource on device
zero and then uses it to allocate a `device_buffer` on device one:

```c++
{
RMM_CUDA_TRY(cudaSetDevice(0));
rmm::device_buffer buf_a(16);

auto mr = rmm::mr::cuda_memory_resource{};
{
RMM_CUDA_TRY(cudaSetDevice(1));
rmm::device_buffer buf_b(16);
// Invalid, current device is 1, but MR is only valid for device 0
rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
}
}
```

// Error: when buf_a is destroyed, the current device must be 0, but it is 1
A correct example creates the device buffer with device zero active.
After that it is safe to switch devices and let the buffer go out of
scope and destruct with a different device active. For example, this
code is correct:
```c++
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
RMM_CUDA_TRY(cudaSetDevice(1));
...
// No need to switch back to device 0 before ~buf runs
}
```

#### Use of `rmm::device_vector` with multiple devices

wence- marked this conversation as resolved.
Show resolved Hide resolved
> [!CAUTION]
wence- marked this conversation as resolved.
Show resolved Hide resolved
> In contrast to the uninitialized `rmm:device_uvector`,
> `rmm::device_vector` **DOES NOT** store the active device during
> construction, and therefore cannot arrange for it to be active when
> the destructor runs. It is therefore the responsibility of the user
> to ensure the currently active device is correct.

`rmm::device_vector` is therefore slightly less ergonomic to use in a
multiple device setting since the caller must arrange that active
devices on allocation and deallocation match. Recapitulating the
previous example using `rmm::device_vector`:

```c++
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
RMM_CUDA_TRY(cudaSetDevice(1));
...
// ERROR: ~vec runs with device 1 active, but needs device 0 to be active
}
```
A correct example adds a call to `cudaSetDevice(0)` on the line of the
error comment before the dtor for `~vec` runs.

## `cuda_stream_view` and `cuda_stream`

Expand Down
Loading