Description
Proposal
Add the amdgpu
target to rustc that allows to generate code for AMD GPUs.
The LLVM backend has good support for this backend. The goal is to expose this to Rust, enabling Rust as another language on these GPUs. The main target is compute capabilities. This is in contrast to the rust-gpu project, which targets graphics capabilities through spir-v (the graphics variant). The base runtime to run compute programs on AMD GPUs is HSA (Heterogeneous System Architecture), which is implemented in ROCR-Runtime. Therefore, the Rust backend should target the amdhsa OS. To support the target in rustc, LLVM needs to be compiled with the amdgpu backend enabled. On Windows (and Linux), HIP can be used to load the same compiled amdgpu programs.
There are two points in which the amdgpu target is different from other (mainstream/x86) targets.
Address spaces
Address spaces can be thought of as denoting different physical memory areas (this is a thought concept, it can be that way in hardware, but it does not need to be). In LLVM IR, each pointer has an address space, defaulting to addrspace(0)
(this is implicit in textual IR, which is why you won’t see it there). Different address spaces have different properties, e.g. they can have a different pointer size, the nullptr can be different (e.g. 0
vs -1
) and the machine instructions used to access them can be different.
The amdgpu LLVM backend makes heavy use of address spaces. This is also a problem for other targets that want to support Rust (though mostly more exotic ones). The use of address spaces leads to situations, where a pointer in one address space needs to be casted to a pointer in a different address space. In LLVM IR, bitcast
is invalid for this case, addrspacecast
needs to be used.
The changes to rustc code should be mostly about fixing problems in a rather contained way. (I don’t know for sure how contained it will be, but compiling core
required surprisingly few changes as can be seen in rust-lang/rust#134740; disclaimer: I only tried running a very simple program with -Zbuild-std=core
so far.) The changes will bring the rust llvm backend closer to how LLVM envisions address spaces, which should make it easier to support other future targets that use more address spaces (there was already some work for other targets, which in turn made it easier for amdgpu).
To get a feeling for what LLVM address spaces are used for, here is the list of the important amdgpu address spaces (from https://llvm.org/docs/AMDGPUUsage.html#address-spaces):
- 0 (generic/flat): This is a “catch all”. Loads and stores to
addrspace(0)
may go to any of the below address spaces. The hardware switches at runtime, depending on the pointer. This works for all pointers, but is the slowest. - 1 (global): This is basically VRAM (i.e. the most “normal” memory).
- 3 (local/LDS): This is different memory in hardware, basically a software-defined cache. Loads and stores are a lot faster than to VRAM, but not globally visible to all other threads. (Known as
groupshared
in HLSL orshared
in GLSL.) - 4 (constant): Same memory region as global (1), but guaranteed to be constant throughout the program. This allows using different instructions that are faster by going through a different cache.
- 5 (private/scratch): This is used for the stack, all
alloca
s need to be inaddrspace(5)
. A thread can only access its own private memory.
Basic support for the amdgpu target means using address spaces 1 (incoming pointers), 5 (allocas) and 0 (if we don’t know which one it is). Support for groupshared memory in the language requires its own RFC (something similar to thread_local
probably makes sense).
Casting pointers to addrspace(0)
before use
I experimented more and came to the conclusion that all pointers need to be casted to addrspace (0)
before they are used (this affects alloca and global variables). If we don’t do that, things
go wrong with the below code:
fn f(p: *const i8 /* addrspace(0) */) -> *const i8 /* addrspace(0) */ {
let local = 0i8; /* addrspace(5) */
let res = if cond { p } else { &raw const local };
res
}
results in
%local = alloca addrspace(5) i8
%res = alloca addrspace(5) ptr
if:
; Store 64-bit flat pointer
store ptr %p, ptr addrspace(5) %res
else:
; Store 32-bit scratch pointer
store ptr addrspace(5) %local, ptr addrspace(5) %res
ret:
; Load and return 64-bit flat pointer
%res.load = load ptr, ptr addrspace(5) %res
ret ptr %res.load
This may store a 32-bit pointer and read it back as a 64-bit pointer, which is obviously wrong and
cannot work. Instead, we need to addrspacecast %local to ptr addrspace(0)
, then we store and load
the correct type.
So, I think the way to go is casting every pointer to addrspace(0)
immediately after creating an
alloca
or a global. For alloca, the change is just 2 lines, for globals it is a bit more involved due to vtables, where the global variable is modified after it is created and therefore we need to look through the addrspacecast
constexpr when adding attributes.
Many processors / target-cpus
Every generation of GPUs uses different machine code (to some extent). LLVM supports them as different “cpu”s or processors (-Ctarget-cpu=
argument for rustc).
There are two challenges for this regarding Rust support
- A single processor needs to be the default in the target description.
- If Rust would distribute compiled libraries (e.g.
core
), it would need to be for all processors to be useful.
There is no obvious choice for the default processor. If some processor is the default and a user tries to use the amdgpu target without overwriting the target-cpu
, it likely results in an unusable binary. We could use a non-existing “cpu” as the default, resulting in compiler errors, to make users aware of the need to set a target-cpu
.
E.g "please specify -Ctarget-cpu"
results in failing compilations and warnings:
'please specify -Ctarget-cpu' is not a recognized processor for this target (ignoring processor)
Alternatively, some choice can be made, like gfx900
.
There is a PR that fails compilation if no cpu is specified explicitly: rust-lang/rust#135030
Regarding 2., there may be a generic backend for amdgpu in the future, relying on spir-v (the compute variant) which would solve both these issues, but that is a long way to go and unsure if it eventually happens. For the reason of binary size alone, it does not make sense for Rust to distribute pre-compiled code for the amdgpu backend. Users should instead specify their processor via -Ctarget-cpu=
and compile core via -Zbuild-std=core
or similar means.
The list of processors supported by LLVM is here: https://llvm.org/docs/AMDGPUUsage.html#processors
Related issues
- The PR to add the amdgpu target: Add amdgpu target rust#134740
- New tracking issue: Tracking Issue for amdgcn target rust#135024
- The original tracking issue for amdgpu (has been closed for a while): Tracking issue for targeting AMDGPU devices rust#51575
Mentors or Reviewers
None yet, this is my first Rust contribution :)
Process
The main points of the Major Change Process are as follows:
- File an issue describing the proposal.
- A compiler team member or contributor who is knowledgeable in the area can second by writing
@rustbot second
.- Finding a "second" suffices for internal changes. If however, you are proposing a new public-facing feature, such as a
-C flag
, then full team check-off is required. - Compiler team members can initiate a check-off via
@rfcbot fcp merge
on either the MCP or the PR.
- Finding a "second" suffices for internal changes. If however, you are proposing a new public-facing feature, such as a
- Once an MCP is seconded, the Final Comment Period begins. If no objections are raised after 10 days, the MCP is considered approved.
You can read more about Major Change Proposals on forge.
Comments
This issue is not meant to be used for technical discussion. There is a Zulip stream for that. Use this issue to leave procedural comments, such as volunteering to review, indicating that you second the proposal (or third, etc), or raising a concern that you would like to be addressed.