CUDA: Enable FP16_MMA for RDNA3 with rocWMMA (PoC) #9594

Nekotekina · 2024-09-22T17:03:08Z

From my testing, this significantly improves batch processing when FlashAttn is enabled. This also slightly improves token generation time on Gemma2-9B model. I'm not experienced in GPU programming though, naïve port might be suboptimal for AMD.

rocWMMA: https://gpuopen.com/learn/wmma_on_rdna3/

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

JohannesGaessler · 2024-09-22T18:41:45Z

I do not own a compatible AMD GPU for testing and debugging. Are you going to be available long-term to fix any potential issues that will potentially crop up around this? This is particularly relevant because the WMMA kernels are written poorly and I plan to replace them in the coming months with kernels that instead make use of PTX instructions (the CUDA equivalent of assembly) in order to get a well-defined memory layout. At that point the WMMA kernels would become deprecated and I would not want to invest the effort to maintain them myself.

Nekotekina · 2024-09-22T19:04:15Z

Hmm... if they're going to be rewritten, then it may be not worth it after all. I'll be able to test it and fix simple problems, but I still don't know much about CUDA in general. On the other hand, it sounds like "PTX" implementation will make optimizing for AMD even harder in future, an the gap will increase ever more... Unless I'm misunderstanding something. Are WMMA kernels really that poorly written that it would hinder NVidia performance?

JohannesGaessler · 2024-09-22T20:17:22Z

The bottom line is this: GPU performance is heavily impacted by memory bandwidth. Firstly the memory bandwidth for loading data from VRAM but also for the more compute intensive kernels the memory bandwidth from shared memory. In both cases the data layout and how you access it is extremely important, but the HIP conversion of CUDA code for AMD does not seem to work well for the latter case. The two notable kernels for heavy shared memory use are MMQ and the FA kernels for large batch sizes (WMMA, tile F16, tile F32).

The problem with the CUDA WMMA interface is that in order to provide the same interface for different tensor core hardware it has to ignore the underlying memory layout which is just really bad. I'm very confident that the performance will improve significantly by instead taking the same approach as with MMQ: define simple primitives in ggml/src/ggml-cuda/mma.cuh that expose tensor cores in such a way that you get a well-defined memory layout and use that instead.

The best solution to the AMD performance issues would be to write dedicated ROCm kernels. But obviously that would also be the biggest amount of work, especially because MMQ and FA took the most time to optimize performance for. What would potentially also work would be to adapt the code in ggml/src/ggml-cuda/mma.cuh in such a way that AMD tensor cores can be accessed with the same interface (I did so far not look into this at all but I would expect similar AMD performance to rocWMMA). Or maybe figure out why the performance for FA tile F16/F32 is bad on AMD (since those are also going to be much faster than the vector kernels for large batch sizes).

In case you're not aware, I will happily talk to potential devs and explain the implementation details to them in order to help them get started - my email address can be found on my GitHub page.

CUDA: Enable FP16_MMA for RDNA3 with rocWMMA

0c277a2

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Enable FP16_MMA for RDNA3 with rocWMMA (PoC) #9594

CUDA: Enable FP16_MMA for RDNA3 with rocWMMA (PoC) #9594

Nekotekina commented Sep 22, 2024

JohannesGaessler commented Sep 22, 2024

Nekotekina commented Sep 22, 2024

JohannesGaessler commented Sep 22, 2024

CUDA: Enable FP16_MMA for RDNA3 with rocWMMA (PoC) #9594

Are you sure you want to change the base?

CUDA: Enable FP16_MMA for RDNA3 with rocWMMA (PoC) #9594

Conversation

Nekotekina commented Sep 22, 2024

JohannesGaessler commented Sep 22, 2024

Nekotekina commented Sep 22, 2024

JohannesGaessler commented Sep 22, 2024