Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safety soundness of mut borrows in wavefront/warp/superslice invoked functions #10

Open
DiamondLovesYou opened this issue Nov 16, 2018 · 12 comments

Comments

@DiamondLovesYou
Copy link
Contributor

Consider this valid rust statement:

some_mut_slice[0] = dispatch_packet().global_id_x();

where the right hand value is basically the equivalent to get_global_id(0) in OpenCl.

On AMDGPU at least, the value of some_mut_slice[0] after the kernel returns is undefined. I'll bet Nvidia is similar in this. IBM's Power9 (as used in Summit, the fastest supercomputer in the world currently) which features SMT, so for example SMT1 on SMT4 hardware would be 4 slices (threads, basically) running w/ a single instruction pointer, would be the same in SMT1 or SMT2 (though I don't have documentation to back this up).

Essentially, I think the issue with the borrow model is that it assumes a single thread is a single thread and not a wavefront/warp/superslice. Thus, a mut borrow is unique.

I have no idea how to go about solving this.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

Consider this valid rust statement:

If what you are saying is that you are somehow constructing more than one mutable reference to some memory, then this is anything but valid. Since you haven't provided a full example, I can only speculate about what you mean.

I have no idea how to go about solving this.

Many libraries solve this problem (e.g. rayon).

I'll bet Nvidia is similar in this.

I don't think the nvptx backend currently has this problem (EDIT: as in, it doesn't have any sort of implicit arrays yet, but I think we currently allow passing references/slices as arguments, which might be incorrect).

@DiamondLovesYou
Copy link
Contributor Author

I mean Rust will compile it without error. And that statement actually doesn't construct more than one mut ref on SIMT machines, instead it's conceptually multiple threads having access to a single mut borrow. On AMDGPU at least, the mut ref will be in scalar registers; the work items will have access to it (to which the hardware will scalarize the writes in an undefined order). Rust is unaware of the SIMT-y nature.

I recognize that the statement itself is a bit contrived and obviously invalid (for me this is the big reason why I didn't think of this as an issue). The issue is that Rust doesn't think it's invalid either.

@bheisler
Copy link

I think I see what you mean, yeah. The SIMT model is inconsistent with the safety guarantees of Rust.

Rayon doesn't solve this problem. Rayon (or, more accurately, std) prevents this problem by using the type system and the Send and Sync traits to guarantee that no two CPU threads can share a reference to the same non-Sync value, and that no two CPU threads can share a mutable reference to any value. Thus, Rust can prevent data races at compile time. Working around this limitation requires the use of unsafe.

On the GPU, it's trivial to have multiple threads with mutable references to the same value, or with shared references to non-Sync values. That makes it possible to have memory unsafety and data races in Rust code without explicitly using unsafe.

It's unfortunate, but I think we just have to accept that GPGPU doesn't fit easily into Rust's safety guarantees. In addition to this, there are several ways to create data-races using the CUDA API - for instance, launching a kernel that writes to Unified memory, then reading that memory on the host without waiting for the kernel launch to complete. At worst, it just means that kernel entry functions and asynchronous launches must be marked as unsafe. Perhaps with time we'll think of better solutions.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

On the GPU, it's trivial to have multiple threads with mutable references to the same value, or with shared references to non-Sync values.

Kernels don't have shared memory arrays (today) so one would need to pass kernels a mutable reference to create undefined behavior in safe code. We just need to add a check to the ptx-abi to forbid any type that transitively contains a &mut to anything.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

In particular, device functions can accept &mut T as arguments, only global functions cannot.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

Technically, we don't even have to do that. Launching a global kernel is unsafe (like calling an extern function). If a global kernel accepts a &mut T, the unsafe code in the host that launches the kernel is incorrect because it copies a &mut T into each SIMT thread.

@DiamondLovesYou
Copy link
Contributor Author

@gnzlbg You requested a complete example on zulip. I've pasted it below:

#![feature(allocator_api)]

extern crate hsa_rt;
extern crate runtime;
#[macro_use]
extern crate log;
extern crate env_logger;
extern crate legionella_std;

use std::alloc::{Global, Alloc, Layout, };
use std::mem::{size_of, };

use hsa_rt::agent::{DeviceType, };
use hsa_rt::queue::{FenceScope, };

use runtime::context::{Context, };
use runtime::module::{Invoc, };

use legionella_std::{dispatch_packet, mem::*, mem::slice::*, };

const WORKITEM_SIZE: usize = 8;
const X_SIZE: usize = 1000;

/// This is our kernel.
fn obviously_undefined_behaviour(mut out: SliceMut<u64>) {
  let dispatch = dispatch_packet();
  out[0] = dispatch.global_id_x() as u64;
}

pub fn main() {
  env_logger::init();
  let ctxt = Context::new()
    .expect("create context");

  let accel = ctxt
    .find_accel(|accel| {
      match accel.agent().device_type() {
        Ok(DeviceType::Gpu) => true,
        _ => false,
      }
    })
    .expect("lock failure")
    .expect("no suitable accelerators");

  let workitems = accel.isa()
    .map(|isa| {
      isa.workgroup_max_size()
    })
    .unwrap_or(Ok(1))
    .expect("get max workgroup size") as usize;
  assert!(workitems >= WORKITEM_SIZE,
          "not enough workitems per workgroup for this example");
  info!("using workgroup size of {}", WORKITEM_SIZE);

  let mut invoc = Invoc::new(ctxt.clone(), obviously_undefined_behaviour)
    .expect("Invoc::new");

  invoc.workgroup_dims((WORKITEM_SIZE, ))
    .expect("Invoc::workgroup_dims");
  invoc.grid_dims((X_SIZE, ))
    .expect("Invoc::grid_dims");
  invoc.begin_fence = Some(FenceScope::System);
  invoc.end_fence = Some(FenceScope::System);

  let kernel_props = invoc.precompile(&accel)
    .expect("kernel compilation");

  let agent = accel.agent();

  let data_bytes = X_SIZE * size_of::<u64>();

  let group_size = Some(kernel_props.group_segment_size());
  let private_size = Some(kernel_props.private_segment_size());

  let host_data = unsafe {
    // allocate the host frame data w/ page alignment. This isn't
    // *required*, but I'm betting nicer for the driver.
    // XXX hardcoded page size.
    let layout =
      Layout::from_size_align(data_bytes, 4096)
        .unwrap();
    let data = Global.alloc(layout)
      .expect("allocate kernel data");
    Vec::from_raw_parts(data.as_ptr() as *mut u64,
                        X_SIZE, X_SIZE)
  };
  let mut data = host_data.lock_memory(&[agent])
    .expect("lock host memory to GPU");

  let queue = agent.new_kernel_multi_queue(4, group_size, private_size)
    .expect("Agent::new_kernel_multi_queue");

  invoc.call_accel_sync(&accel, &queue, (data.as_slice_mut(), ))
    .expect("Invoc::call_accel_sync");

  println!("the winning global id is: {}", data.as_slice()[0]);
}

I ran this a few times and the results were 463, 815, and 495. I think you get the idea.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

I think you get the idea.

Yes, so that's what I had in mind that you were trying to say. In that example, your Invoc::new API is unsound. Potentially, the language extension that you have implemented in your fork of Rust is unsound as well (unless you give us access, or a specification of it, it's hard to tell).

The NVPTX backend that's part of Rust does not have these problems. The only way to introduce undefined behavior is by using unsafe to launch a global kernel, and if you launch a global kernel that results in multiples &mut T to the same memory, then there is the error that introduces undefined behavior - don't do that. We should definitely consider rejecting global kernels that accept mutable references.

@DiamondLovesYou
Copy link
Contributor Author

My fork (not really a fork as it tracks master; I just use a few patches on top of master which haven't made it into Rust proper. Nothing exotic) doesn't extend the Rust language at all (which is the point). The NVPTX target 100% has this issue too; the issue is completely target independent. My runtime doesn't support CUDA/NVPTX (and I'd have to refactor a couple of things to abstract HSA/CUDA libraries first, so not a change I could finish today) but the crate is agnostic of the specific target machine.

I'd be willing to give working group members free access (possibly including source), but I haven't yet setup the infra to support that yet (or infra to support commercial customers for that matter, WIP).

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2018

Can you show an example that shows this issue with the NVPTX target? I don't think one can be constructed without unsafe code.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 17, 2018

Particularly, since only global NVPTX kernels can be launched from the host, and these are always unsafe fn. If they don't uphold Rust safety guarantees, then they are incorrect. I've elaborated a little bit more about it in this issue: #11

@DiamondLovesYou
Copy link
Contributor Author

I believe I've at least partially solved this SIMT issue (or, I've found a sort of workaround for it).

A rough description of the solution:

Require SIMT entry points to be Fn(), that is, take no arguments and have no return value. Instead, one would pass data to the kernel via externally defined globals and programmically define globals with the resources used in the kernel just before launch.
This way, if an argument is to be mutated device side, it'll have to be declared static mut (or otherwise be marked with interior mutability, like the atomics), and thus Rust will demand unsafe blocks to access.

(I suppose I'm writing this my framework's POV, which transforms things so such globals are undefined as expected. Otherwise, you'd have to use an extern block for such globals, which I'm pretty sure require unsafe unconditionally, so the effect should be similar, if a bit overzealous for this case)

On its face, this won't allow implicitly safe usage, and runs afoul of device side enqueue/kernel launch (as the to-be-launched kernel copied to a new kernel so the requested parameter relocations could be written including possibly to the .text section, but I'm pretty sure that's not really something you'd want the GPU to do anyway even if it technically could; thus you'd have to round trip to the CPU first. Which defeats the whole point).

Granted, this workaround flies in face of what is pretty much universal practice in GPGPU space. Though it is curious that OpenGL/GLSL/Vulkan/SPIRV use globals similarly to above for resources provided to shaders and even kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants