rapidsai · rapids-bot · Apr 11, 2024 · Apr 10, 2024 · Apr 11, 2024
diff --git a/README.md b/README.md
@@ -336,24 +336,71 @@ for(int i = 0; i < N; ++i) {
 
 Note that the CUDA device that is current when creating a `device_memory_resource` must also be
 current any time that `device_memory_resource` is used to deallocate memory, including in a
-destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an
-(incorrect) example that assumes the above example loop has been run to create a
-`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the
-line of the error comment.
+destructor. The RAII class `rmm::device_buffer` and classes that use
+it as a backing store (`rmm::device_scalar` and `rmm::device_uvector`)
+handle this by storing the active device when the constructor is
+called, and then ensuring that the stored device is active whenever an
+allocation or deallocation is performed (including in the destructor).
+The user must therefore only ensure that the device active during
+_creation_ of an `rmm::device_buffer` matches the active device of the
+memory resource being used.
+
+Here is an incorrect example that creates a memory resource on device
+zero and then uses it to allocate a `device_buffer` on device one:
 
 ```c++
 {
   RMM_CUDA_TRY(cudaSetDevice(0));
-  rmm::device_buffer buf_a(16);
-
+  auto mr = rmm::mr::cuda_memory_resource{};
   {
     RMM_CUDA_TRY(cudaSetDevice(1));
-    rmm::device_buffer buf_b(16);
+    // Invalid, current device is 1, but MR is only valid for device 0
+    rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
   }
+}
+```
 
-  // Error: when buf_a is destroyed, the current device must be 0, but it is 1
+A correct example creates the device buffer with device zero active.
+After that it is safe to switch devices and let the buffer go out of
+scope and destruct with a different device active. For example, this
+code is correct:
+```c++
+{
+  RMM_CUDA_TRY(cudaSetDevice(0));
+  auto mr = rmm::mr::cuda_memory_resource{};
+  rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
+  RMM_CUDA_TRY(cudaSetDevice(1));
+  ...
+  // No need to switch back to device 0 before ~buf runs
+}
+```
+
+#### Use of `rmm::device_vector` with multiple devices
+
+> [!CAUTION]
+> In contrast to the uninitialized `rmm:device_uvector`,
+> `rmm::device_vector` **DOES NOT** store the active device during
+> construction, and therefore cannot arrange for it to be active when
+> the destructor runs. It is therefore the responsibility of the user
+> to ensure the currently active device is correct.
+
+`rmm::device_vector` is therefore slightly less ergonomic to use in a
+multiple device setting since the caller must arrange that active
+devices on allocation and deallocation match. Recapitulating the
+previous example using `rmm::device_vector`:
+
+```c++
+{
+  RMM_CUDA_TRY(cudaSetDevice(0));
+  auto mr = rmm::mr::cuda_memory_resource{};
+  rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
+  RMM_CUDA_TRY(cudaSetDevice(1));
+  ...
+  // ERROR: ~vec runs with device 1 active, but needs device 0 to be active
 }
 ```
+A correct example adds a call to `cudaSetDevice(0)` on the line of the
+error comment before the dtor for `~vec` runs.
 
 ## `cuda_stream_view` and `cuda_stream`