-
Notifications
You must be signed in to change notification settings - Fork 12k
cuda : use amd wave sharing intrinsics for warp_reduce functions #6522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
In a quick test I am essentially getting the same results on my RX 6800. |
Sorry, what I said was misleading. What I meant to say is that I also only observe a small speedup. But I am not observing a performance regression. These are the specific numbers:
|
What is the reason for not merge this pr? |
@Engininja2 reported an inconsistent and minimal change in performance and there have been no further updates from him since. As such I don't think it makes sense to merge this PR in its current state, especially when I don't have a good understanding of what this code actually does. |
@Engininja2 it seems like recent merge is failed compile on my 7900XTX at windows.
|
62e98ad
to
7a3f7e9
Compare
I rebased the PR on current master. llama.cpp doesn't spend much time synchronizing between GPU threads so a tiny performance increase probably isn't worth the maintenance burden of additional code. Would this be better converted to a draft PR, or closed? |
Check performance for
with an actual path to a model. The new FlashAttention kernels (I think) are more sensitive to the changes in this PR. |
Unfortunately I get a large performance regression for flash attention.
For gfx1010 I tried replacing directly adding half2s with adding the x and y components and got this result with 75 vgprs spilled:
It's better but still a 5% slowdown, and feels like a bad idea to be hoping that compiler heuristics happen to go the right way. |
Generally speaking the current FlashAttention kernels are bad for large batch sizes anyways so the real question is how the performance will change for kernels that are specifically written for large batch sizes. I think a PR that gives +3% performance in some cases but -23% in others should not be merged. |
It does provide decent speed up for token generation about 2-3% which is useful. |
FWIW, compared to the current master, I can't replicate the regression on my RX 6800 XT, and in fact for large batches the speedup increases accordingly when using flash attention. I did not observe any change in token generation speed (positive or negative).
So this may be worth revisiting and/or gating behind specific model checks. A 12% boost is nothing to sneeze at. |
The warp_reduce functions use
__shfl_xor_sync()
which on AMD HIP gets turned into theds_bpermute
instruction which requires some setup instructions and uses local data share bandwidth and latency. AMD GPUs also support Data-Parallel Primitive (DPP) instructions which can perform certain forms of sharing within a warp for free or nearly free.Using the DPP intrinsic in this PR my GPU gets 1% faster token generation times, but oddly 0-1% slower prompt processing times. Maybe it has to do with there being 4 fewer waits on LDS memory each reduction and how the GPU schedules warps, but I really have no idea. Other GPU models might behave differently.
Details on the instructions available for sharing within a warp are on AMD's website and within each ISA's documentation.
I'd like to note that in the event that a warp_reduce function is called with some threads inactive, that the results will differ between GFX9 GPUs and GFX10+, but I don't think this matters since according to the CUDA documentation performing
__shfl_xor_sync()
with inactive threads is undefined on NVIDIA GPUs anyways.Here are my llama-bench results:
I also tried testing SOFT_MAX and saw between 0-8% speedup except for cases with
[1024,1024,1,1]
or[1023,1023,1,1]
that had max bias of 8, which were ~5% slower. Again I'm not sure why.