Skip to content

Strange result when using DPAS with local memory #279

@masahi

Description

@masahi

Hi, I'm continuing to investigate the strange correctness issue I described in #275 (one of two DPAS calls is returning a zero matrix, i.e. the values of C matrix in A * B + C). I found that the issue arises when all of the followings are true:

(1) DPAS is used
(2) local memory is used
(3) Multiple DPAS are executed by a single workgroup, either via a single subgroup doing multiple DPAS or multiple subgroups (say, in a workgroup with 2 x 8 threads) doing one DPAS each.

Any combination of the two conditions above works fine, so for example if I don't use local memory or DPAS at all, everything is fine. Or using DPAS + local memory to compute a single fp16 m8n8k16 matmul is also fine.

I wrote a SYCL program that demonstrates what I'm trying to do with a custom DSL + SPIRV code generator (TVM).
https://gist.github.com/masahi/92f45c2f9ab456259fbb8903c864b2d3

Interestingly, the SYCL program above works correctly. So I tried to compare the difference in the generated SPIRV between clang and the custom code generator in TVM, and also the final ISA generated for the two cases.

Here are the two IGC dumps:
dump_sycl.zip (result is correct)
dump_tvm.zip (result is incorrect)

The two SPIRV look very similar, the only big difference is how the local memory is declared. While the SYCL one passes local memory as a argument to the kernel, the TVM one declares it as a global variable with a static allocation size. So for the latter case, LLVM IR begins with:

@main_kernel0.X_shared = addrspace(3) global [256 x float] undef, section "localSLM", align 4
@main_kernel0.W_shared = addrspace(3) global [128 x float] undef, section "localSLM", align 4
...
define spir_kernel void @main_kernel0(float addrspace(1)* noalias %X, float addrspace(1)* noalias %W, float addrspace(1)* noalias %compute, <8 x i32> %r0, <8 x i32> %payloadHeader, i16 %localIdX, i16 %localIdY, i16 %localIdZ, i32 %bufferOffset, i32 %bufferOffset1, i32 %bufferOffset2) #0 {
entry:
..

And comparing the two ISA, I found that the SYCL one has sync.allrd instruction before DPAS:

        sync.nop                             null                             {Compacted,A@1}        // $202
        sync.allrd                           ($0,$1)                                                 // $202
        dpas.8x8 (8|M0)          r31:f         null:f            r23:hf            r63.0:hf         {$8} // $202

while the TVM doesn't:

        sync.nop                             null                             {Compacted,F@2}        // $74
        dpas.8x8 (8|M0)          r77:f         null:f            r69:hf            r45.0:hf         {$8} // $74

If I remove the use of local memory from the SYCL program, the sync.allrd is not generated. So I'm assuming that the presence of sync.allrd is unique to the usage of local memory, and if the absence of a necessary sync.allrd can cause a trouble for DPAS, that would explain the correctness issue I'm observing. I spent a lot of time debugging this issue, and I'm increasingly leaning toward thinking that this is another IGC miscompilation bug, similar to #275.

So here are my questions:

  • Does the difference in how local memory is declared (global variable vs passed as a parameter from host) matter at all?
  • Can an absence of sync.allrd instruction completely break the use of DPAS?
  • Why sync.allrd is not generated by IGC from the SPIRV produced by TVM? (see dump_tvm.zip above)

cc @mbelicki @mnaczk

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions