Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Minor arithmetic improvement to MMVQ wrapper kernel #7172

Merged
merged 1 commit into from
May 10, 2024

Conversation

OuadiElfarouki
Copy link
Contributor

This minor arithmetic change (a%b = a - b*(a/b)) -apparently missed by icpx/icx compiler- brings a considerable performance improvement [5% to 35%] to type-1 quantized models (e.g. QK_4* and QK_5_*)_ in text generation when targetting Nvidia devices, without scratching performance of the other supported quantizations AND devices.

Tested devices for this matter :

  • Nvidia : A100-40GB, A4000, RTX 4060
  • intel : intel Data Center GPU Max 1100, intel Arc A770

Compiler : oneAPI 2024.1.0

Thanks for reviewing this @AidanBeltonS @abhilash1910 @airMeng @NeoZhangJianyu

@JohannesGaessler
Copy link
Collaborator

Context: I have written most of the MMVQ CUDA code which to my understanding is what the SYCL code is adapted from.

I don't understand why this change is faster. qi and vdr are powers of 2 that are both known at compile time so their ratio is also a power of 2 known at compile time. A number modulo a power of 2 should be converted by the compiler to just a binary and which should be faster than a subtraction.

Copy link
Collaborator

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me.
However I am also confused as to how subtraction beats modulo as the later should be efficient . For Intel hardware , is there a significant boost as in Nvidia ?
If this is something which looks promising then such changes need to be addressed in dpct and icpx .
Also pinging @ggerganov for a look when available.

@mofosyne mofosyne added the performance Speed related topics label May 9, 2024
@OuadiElfarouki
Copy link
Contributor Author

Context: I have written most of the MMVQ CUDA code which to my understanding is what the SYCL code is adapted from.

I don't understand why this change is faster. qi and vdr are powers of 2 that are both known at compile time so their ratio is also a power of 2 known at compile time. A number modulo a power of 2 should be converted by the compiler to just a binary and which should be faster than a subtraction.

@JohannesGaessler Agree it should be handled by the compiler, in fact a similar change to equivalent operations in other "hot spots" didn't result in such a leap in performance. So reasons seem unclear atm but the numbers might justify the patch.

@OuadiElfarouki
Copy link
Contributor Author

OuadiElfarouki commented May 9, 2024

This looks ok to me. However I am also confused as to how subtraction beats modulo as the later should be efficient . For Intel hardware , is there a significant boost as in Nvidia ? If this is something which looks promising then such changes need to be addressed in dpct and icpx . Also pinging @ggerganov for a look when available.

@abhilash1910 Yes it's odd indeed, performance leap were mainly noticed on Nvidia GPUs (some examples below), no to very little improvement on intel targets though.

Some LLama2 13B examples (patchPerf-masterPerf)/masterPerf):

  • Nvidia A4000

<style type="text/css"></style>

Q4_K_M 36%
Q4_K_S 52%
Q5_K_M 38%
Q5_K_S 52%
  • Nvidia A100 :

<style type="text/css"></style>

Q4_K_M 62%
Q4_K_S 70%
Q5_K_M 57%
Q5_K_S 43%

@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 9, 2024
@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented May 10, 2024

It's working on Intel iGPU. About 3% increase.
It's good!

@NeoZhangJianyu NeoZhangJianyu merged commit 8c570c9 into ggerganov:master May 10, 2024
58 checks passed
joeatodd added a commit that referenced this pull request Jun 17, 2024
joeatodd added a commit that referenced this pull request Jun 18, 2024
#7172)" (#7980)

* Revert "Minor arithmetic improvement to mmvq wrapper kernel (#7172)"

This reverts commit 8c570c9.

* Update ggml-sycl.cpp
Alcpz pushed a commit to Alcpz/llama.cpp that referenced this pull request Jun 20, 2024
ggerganov#7172)" (ggerganov#7980)

* Revert "Minor arithmetic improvement to mmvq wrapper kernel (ggerganov#7172)"

This reverts commit 8c570c9.

* Update ggml-sycl.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants