-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Minor arithmetic improvement to MMVQ wrapper kernel #7172
[SYCL] Minor arithmetic improvement to MMVQ wrapper kernel #7172
Conversation
Context: I have written most of the MMVQ CUDA code which to my understanding is what the SYCL code is adapted from. I don't understand why this change is faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me.
However I am also confused as to how subtraction beats modulo as the later should be efficient . For Intel hardware , is there a significant boost as in Nvidia ?
If this is something which looks promising then such changes need to be addressed in dpct and icpx .
Also pinging @ggerganov for a look when available.
@JohannesGaessler Agree it should be handled by the compiler, in fact a similar change to equivalent operations in other "hot spots" didn't result in such a leap in performance. So reasons seem unclear atm but the numbers might justify the patch. |
@abhilash1910 Yes it's odd indeed, performance leap were mainly noticed on Nvidia GPUs (some examples below), no to very little improvement on intel targets though. Some LLama2 13B examples (patchPerf-masterPerf)/masterPerf):
<style type="text/css"></style>
<style type="text/css"></style>
|
It's working on Intel iGPU. About 3% increase. |
ggerganov#7172)" (ggerganov#7980) * Revert "Minor arithmetic improvement to mmvq wrapper kernel (ggerganov#7172)" This reverts commit 8c570c9. * Update ggml-sycl.cpp
This minor arithmetic change
(a%b = a - b*(a/b))
-apparently missed by icpx/icx compiler- brings a considerable performance improvement [5% to 35%] to type-1 quantized models (e.g. QK_4* and QK_5_*)_ in text generation when targetting Nvidia devices, without scratching performance of the other supported quantizations AND devices.Tested devices for this matter :
Compiler : oneAPI 2024.1.0
Thanks for reviewing this @AidanBeltonS @abhilash1910 @airMeng @NeoZhangJianyu