From 7d7d65ab115b3e96b0ebc51d45a90020d8037439 Mon Sep 17 00:00:00 2001
From: Lawrence Mitchell <lmitchell@nvidia.com>
Date: Thu, 11 Apr 2024 09:12:56 +0100
Subject: [PATCH] Update multi-gpu discussion for device_buffer and
 device_vector dtors (#1524)

Since #1370, the dtor for device_buffer ensures that the correct device is active when the deallocation occurs. We therefore update the example to discuss this. Since device_vector still requires the user to manage the active device correctly by hand, call this out explicitly in the documentation.

- Closes #1523

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Mark Harris (https://github.com/harrism)

URL: https://github.com/rapidsai/rmm/pull/1524
---
 README.md | 59 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 51 insertions(+), 8 deletions(-)
diff --git a/README.md b/README.md
index 0fe848fea..5b7dc69c0 100644
--- a/README.md
+++ b/README.md
@@ -336,25 +336,68 @@ for(int i = 0; i < N; ++i) {
 
 Note that the CUDA device that is current when creating a `device_memory_resource` must also be
 current any time that `device_memory_resource` is used to deallocate memory, including in a
-destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an
-(incorrect) example that assumes the above example loop has been run to create a
-`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the
-line of the error comment.
+destructor. The RAII class `rmm::device_buffer` and classes that use it as a backing store
+(`rmm::device_scalar` and `rmm::device_uvector`) handle this by storing the active device when the
+constructor is called, and then ensuring that the stored device is active whenever an allocation or
+deallocation is performed (including in the destructor). The user must therefore only ensure that
+the device active during _creation_ of an `rmm::device_buffer` matches the active device of the
+memory resource being used.
+
+Here is an incorrect example that creates a memory resource on device zero and then uses it to
+allocate a `device_buffer` on device one:
 
 ```c++
 {
   RMM_CUDA_TRY(cudaSetDevice(0));
-  rmm::device_buffer buf_a(16);
-
+  auto mr = rmm::mr::cuda_memory_resource{};
   {
     RMM_CUDA_TRY(cudaSetDevice(1));
-    rmm::device_buffer buf_b(16);
+    // Invalid, current device is 1, but MR is only valid for device 0
+    rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
   }
+}
+```
+
+A correct example creates the device buffer with device zero active. After that it is safe to switch
+devices and let the buffer go out of scope and destruct with a different device active. For example,
+this code is correct:
+
+```c++
+{
+  RMM_CUDA_TRY(cudaSetDevice(0));
+  auto mr = rmm::mr::cuda_memory_resource{};
+  rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
+  RMM_CUDA_TRY(cudaSetDevice(1));
+  ...
+  // No need to switch back to device 0 before ~buf runs
+}
+```
+
+#### Use of `rmm::device_vector` with multiple devices
+
+> [!CAUTION] In contrast to the uninitialized `rmm:device_uvector`, `rmm::device_vector` **DOES
+> NOT** store the active device during construction, and therefore cannot arrange for it to be
+> active when the destructor runs. It is therefore the responsibility of the user to ensure the
+> currently active device is correct.
+
+`rmm::device_vector` is therefore slightly less ergonomic to use in a multiple device setting since
+the caller must arrange that active devices on allocation and deallocation match. Recapitulating the
+previous example using `rmm::device_vector`:
 
-  // Error: when buf_a is destroyed, the current device must be 0, but it is 1
+```c++
+{
+  RMM_CUDA_TRY(cudaSetDevice(0));
+  auto mr = rmm::mr::cuda_memory_resource{};
+  rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
+  RMM_CUDA_TRY(cudaSetDevice(1));
+  ...
+  // ERROR: ~vec runs with device 1 active, but needs device 0 to be active
 }
 ```
 
+A correct example adds a call to `cudaSetDevice(0)` on the line of the error comment before the dtor
+for `~vec` runs.
+
 ## `cuda_stream_view` and `cuda_stream`
 
 `rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's