-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Hi, I'm continuing to investigate the strange correctness issue I described in #275 (one of two DPAS calls is returning a zero matrix, i.e. the values of C matrix in A * B + C). I found that the issue arises when all of the followings are true:
(1) DPAS is used
(2) local memory is used
(3) Multiple DPAS are executed by a single workgroup, either via a single subgroup doing multiple DPAS or multiple subgroups (say, in a workgroup with 2 x 8 threads) doing one DPAS each.
Any combination of the two conditions above works fine, so for example if I don't use local memory or DPAS at all, everything is fine. Or using DPAS + local memory to compute a single fp16 m8n8k16 matmul is also fine.
I wrote a SYCL program that demonstrates what I'm trying to do with a custom DSL + SPIRV code generator (TVM).
https://gist.github.com/masahi/92f45c2f9ab456259fbb8903c864b2d3
Interestingly, the SYCL program above works correctly. So I tried to compare the difference in the generated SPIRV between clang and the custom code generator in TVM, and also the final ISA generated for the two cases.
Here are the two IGC dumps:
dump_sycl.zip (result is correct)
dump_tvm.zip (result is incorrect)
The two SPIRV look very similar, the only big difference is how the local memory is declared. While the SYCL one passes local memory as a argument to the kernel, the TVM one declares it as a global variable with a static allocation size. So for the latter case, LLVM IR begins with:
@main_kernel0.X_shared = addrspace(3) global [256 x float] undef, section "localSLM", align 4
@main_kernel0.W_shared = addrspace(3) global [128 x float] undef, section "localSLM", align 4
...
define spir_kernel void @main_kernel0(float addrspace(1)* noalias %X, float addrspace(1)* noalias %W, float addrspace(1)* noalias %compute, <8 x i32> %r0, <8 x i32> %payloadHeader, i16 %localIdX, i16 %localIdY, i16 %localIdZ, i32 %bufferOffset, i32 %bufferOffset1, i32 %bufferOffset2) #0 {
entry:
..
And comparing the two ISA, I found that the SYCL one has sync.allrd instruction before DPAS:
sync.nop null {Compacted,A@1} // $202
sync.allrd ($0,$1) // $202
dpas.8x8 (8|M0) r31:f null:f r23:hf r63.0:hf {$8} // $202
while the TVM doesn't:
sync.nop null {Compacted,F@2} // $74
dpas.8x8 (8|M0) r77:f null:f r69:hf r45.0:hf {$8} // $74
If I remove the use of local memory from the SYCL program, the sync.allrd is not generated. So I'm assuming that the presence of sync.allrd is unique to the usage of local memory, and if the absence of a necessary sync.allrd can cause a trouble for DPAS, that would explain the correctness issue I'm observing. I spent a lot of time debugging this issue, and I'm increasingly leaning toward thinking that this is another IGC miscompilation bug, similar to #275.
So here are my questions:
- Does the difference in how local memory is declared (global variable vs passed as a parameter from host) matter at all?
- Can an absence of
sync.allrdinstruction completely break the use of DPAS? - Why
sync.allrdis not generated by IGC from the SPIRV produced by TVM? (see dump_tvm.zip above)