Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Enable FP16_MMA for RDNA3 with rocWMMA (PoC) #9594

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Nekotekina
Copy link

From my testing, this significantly improves batch processing when FlashAttn is enabled. This also slightly improves token generation time on Gemma2-9B model. I'm not experienced in GPU programming though, naïve port might be suboptimal for AMD.

rocWMMA: https://gpuopen.com/learn/wmma_on_rdna3/

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 22, 2024
@JohannesGaessler
Copy link
Collaborator

I do not own a compatible AMD GPU for testing and debugging. Are you going to be available long-term to fix any potential issues that will potentially crop up around this? This is particularly relevant because the WMMA kernels are written poorly and I plan to replace them in the coming months with kernels that instead make use of PTX instructions (the CUDA equivalent of assembly) in order to get a well-defined memory layout. At that point the WMMA kernels would become deprecated and I would not want to invest the effort to maintain them myself.

@Nekotekina
Copy link
Author

Hmm... if they're going to be rewritten, then it may be not worth it after all. I'll be able to test it and fix simple problems, but I still don't know much about CUDA in general. On the other hand, it sounds like "PTX" implementation will make optimizing for AMD even harder in future, an the gap will increase ever more... Unless I'm misunderstanding something. Are WMMA kernels really that poorly written that it would hinder NVidia performance?

@JohannesGaessler
Copy link
Collaborator

The bottom line is this: GPU performance is heavily impacted by memory bandwidth. Firstly the memory bandwidth for loading data from VRAM but also for the more compute intensive kernels the memory bandwidth from shared memory. In both cases the data layout and how you access it is extremely important, but the HIP conversion of CUDA code for AMD does not seem to work well for the latter case. The two notable kernels for heavy shared memory use are MMQ and the FA kernels for large batch sizes (WMMA, tile F16, tile F32).

The problem with the CUDA WMMA interface is that in order to provide the same interface for different tensor core hardware it has to ignore the underlying memory layout which is just really bad. I'm very confident that the performance will improve significantly by instead taking the same approach as with MMQ: define simple primitives in ggml/src/ggml-cuda/mma.cuh that expose tensor cores in such a way that you get a well-defined memory layout and use that instead.

The best solution to the AMD performance issues would be to write dedicated ROCm kernels. But obviously that would also be the biggest amount of work, especially because MMQ and FA took the most time to optimize performance for. What would potentially also work would be to adapt the code in ggml/src/ggml-cuda/mma.cuh in such a way that AMD tensor cores can be accessed with the same interface (I did so far not look into this at all but I would expect similar AMD performance to rocWMMA). Or maybe figure out why the performance for FA tile F16/F32 is bad on AMD (since those are also going to be much faster than the vector kernels for large batch sizes).

In case you're not aware, I will happily talk to potential devs and explain the implementation details to them in order to help them get started - my email address can be found on my GitHub page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants