Description
Motivation
Launching NVPTX global kernels is unsafe
- they are unsafe fn
, and this requires the Rust program that launches to use an unsafe
block. For most examples below, this program has undefined behavior because the unsafe
code it contains is incorrect.
However, must of the kernels below are never correct, so it would be very helpful for the compiler to reject them, or to at least warn about their issues.
Examples
These are some examples of code that's accepted today. Most of these examples are always UB.
Launching these global kernels
#[no_mangle]
pub unsafe extern "ptx-kernel" fn foo(a: &mut f32) {} // UB: multiple &mut
pub struct Bar<'a>(&'a mut f32);
#[no_mangle]
pub unsafe extern "ptx-kernel" fn bar<'a>(a: Bar<'a>) {} // UB: multiple &mut
is always undefined behavior: these kernels are spawned in multiple threads of execution, each containing a copy of the same&mut T
to the same data. On the other hand:
#[no_mangle]
pub unsafe extern "ptx-kernel" fn foo(a: &mut f32) {}
#[no_mangle]
pub unsafe extern "ptx-kernel" fn bar(mut a: f32) {
foo(&mut a); // OK - global kernels can be called from other kernels
}
global kernels that are called from other kernels are executed in the same thread of execution. Device kernels as well:
fn device(a: &mut i32) { a += 1 } // OK
#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
let mut a = 0;
device(&mut a); // OK: each a is local to each thread of execution
}
We don't support static and dynamic shared arrays in kernels yet, but NVPTX does, and we'd like to support them at some point. These arrays are shared across all threads of execution without any synchronization:
fn device(a: &mut [i32; 32]) {
a[0] += 1; // UB: data-race
}
#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
let mut a = UnsyncShared::<[0_i32; 32]>::new(); // OK: create unsynchronized shared memory array
device(&mut a); // UB: multiple &mut to same object
}
Note that there are two issues with these. When a device function creates them, these are shared across all execution threads of that device function. That is, taking a &mut T
to the whole array creates many copies, one on each execution thread, of the same &mut T
to the exact same data. This is already undefined behavior, and can be used to introduce data-races.
We might want to support synchronized (e.g. atomic) versions of the shared memory arrays as well. While they might avoid the data-race, taking a &mut T
to the array still creates multiple &mut T
to the same data, which is undefined behavior. That is, just adding synchronization does not solve the problem (this is also not desirable for performance).
We'd like to accept this code:
fn device(a: &mut i32) { *a += 1; }
#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
let mut a = UnsyncShared::<[0_i32; 32]>::new(); // OK: create unsynchronized shared memory array
device(&mut a[nvptx::_thread_idx_x()])); // UB
}
but note that IndexMut::index_mut(&mut self)
would create multiple &mut T
to the shared array, one on each thread, which results in UB as well. The following example should work, but is not very nice:
fn device(a: &mut i32) { *a += 1; }
#[no_mangle]
pub unsafe extern "ptx-kernel" fn global() {
let mut a = UnsyncShared::<[0_i32; 32]>::new(); // OK: create unsynchronized shared memory array
let p = &mut a as *mut _ as *mut i32; // OK: &mut T as *mut T does not create a &mut T
device(unsafe { &mut * p.add(nvptx::_thread_idx_x()) }); // OK
}
Questions
What general approaches do we have to make these examples sound?
- trivial: reject global kernels (abi
ptx-kernel
) that are notunsafe fn
Should we also pursue an approach that lints on "improper global/device kernel arguments" ? E.g.
- global and device kernel arguments require:
Sync
- probably as a too hard constraint, since it does not allow raw pointers, also we technically only requireSync
for mutable references to shared memory. Mutable references that do not point to shared memory are fine. - launching a global kernel requires the types passed to the kernel to be:
SendGPU
or similar (DeviceCopy
as @bheisler put it below), since these arguments need to be sendable from the Host to the Device, and Copyable to the multiple execution threads of the device.
It might get tricky to propagate these lints through generic code, e.g., when calling Index::index
as a device function. Also, Sync
prevents raw pointers. A simple wrapper solves this, but we might want to allow raw pointers for convenience here.
What do we do about shared memory device arrays? Taking a &mut
to them is always undefined behavior, which makes them extremely easy to use incorrectly, and very hard and unergonomic to use correctly.
Are there any other ways of tackling this problem?