From 7d7d65ab115b3e96b0ebc51d45a90020d8037439 Mon Sep 17 00:00:00 2001 From: Lawrence Mitchell Date: Thu, 11 Apr 2024 09:12:56 +0100 Subject: [PATCH] Update multi-gpu discussion for device_buffer and device_vector dtors (#1524) Since #1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation. - Closes #1523 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Mark Harris (https://github.com/harrism) URL: https://github.com/rapidsai/rmm/pull/1524 --- README.md | 59 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 51 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 0fe848fea..5b7dc69c0 100644 --- a/README.md +++ b/README.md @@ -336,25 +336,68 @@ for(int i = 0; i < N; ++i) { Note that the CUDA device that is current when creating a `device_memory_resource` must also be current any time that `device_memory_resource` is used to deallocate memory, including in a -destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an -(incorrect) example that assumes the above example loop has been run to create a -`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the -line of the error comment. +destructor. The RAII class `rmm::device_buffer` and classes that use it as a backing store +(`rmm::device_scalar` and `rmm::device_uvector`) handle this by storing the active device when the +constructor is called, and then ensuring that the stored device is active whenever an allocation or +deallocation is performed (including in the destructor). The user must therefore only ensure that +the device active during _creation_ of an `rmm::device_buffer` matches the active device of the +memory resource being used. + +Here is an incorrect example that creates a memory resource on device zero and then uses it to +allocate a `device_buffer` on device one: ```c++ { RMM_CUDA_TRY(cudaSetDevice(0)); - rmm::device_buffer buf_a(16); - + auto mr = rmm::mr::cuda_memory_resource{}; { RMM_CUDA_TRY(cudaSetDevice(1)); - rmm::device_buffer buf_b(16); + // Invalid, current device is 1, but MR is only valid for device 0 + rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr); } +} +``` + +A correct example creates the device buffer with device zero active. After that it is safe to switch +devices and let the buffer go out of scope and destruct with a different device active. For example, +this code is correct: + +```c++ +{ + RMM_CUDA_TRY(cudaSetDevice(0)); + auto mr = rmm::mr::cuda_memory_resource{}; + rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr); + RMM_CUDA_TRY(cudaSetDevice(1)); + ... + // No need to switch back to device 0 before ~buf runs +} +``` + +#### Use of `rmm::device_vector` with multiple devices + +> [!CAUTION] In contrast to the uninitialized `rmm:device_uvector`, `rmm::device_vector` **DOES +> NOT** store the active device during construction, and therefore cannot arrange for it to be +> active when the destructor runs. It is therefore the responsibility of the user to ensure the +> currently active device is correct. + +`rmm::device_vector` is therefore slightly less ergonomic to use in a multiple device setting since +the caller must arrange that active devices on allocation and deallocation match. Recapitulating the +previous example using `rmm::device_vector`: - // Error: when buf_a is destroyed, the current device must be 0, but it is 1 +```c++ +{ + RMM_CUDA_TRY(cudaSetDevice(0)); + auto mr = rmm::mr::cuda_memory_resource{}; + rmm::device_vector vec(16, rmm::mr::thrust_allocator(rmm::cuda_stream_default, &mr)); + RMM_CUDA_TRY(cudaSetDevice(1)); + ... + // ERROR: ~vec runs with device 1 active, but needs device 0 to be active } ``` +A correct example adds a call to `cudaSetDevice(0)` on the line of the error comment before the dtor +for `~vec` runs. + ## `cuda_stream_view` and `cuda_stream` `rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's