-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: faster non k-quant mul_mat_q kernels #2483
CUDA: faster non k-quant mul_mat_q kernels #2483
Conversation
Another excellent contribution! Processing 1800 tokens with 13B q5_1 on an RTX 2060 laptop results in pp of 11.7ms/t with this PR. With cublas it's 14.8ms/t. A very noticeable and most welcome speedboost! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3090 Ti / WSL2 / 7B q4_0 pp:
Master: 1544 t/s
PR: 1776 t/s
ef63142
to
d6154f5
Compare
There might be a hiccup there. |
Sorry, but I don't see how that could in any way be related to the changes I did in this PR. |
It's weird for me too, so I redid the test. At 3800 tokens in context with your PR (63/63, batch size of 256, max context of 5632), VRAM gets full and I get "CUDA error 2 at B:\kobold2\ggml-cuda.cu:4194: out of memory" on the aforementioned 33b K_3S model. Without your PR, at 3,800 tokens in context, the VRAM of my 3090 is at 24075/24576 mb occupied. In both cases, the VRAM occupation at zero token in context is at 23893 mb once the model is loaded. I don't think that I picked a wrong PR prior to compilation. |
@Nexesenex Just so everyone is on the same page, you are comparing commits 4f6b60c and 468ea24? |
@Nexesenex for best results to avoid any complicating factors, I'd recommend benchmarking directly from code from this repo for comparison again - although I am using the CUDA code verbatim when I merge downstream, there might potentially be other components in koboldcpp that could influence speed/memory usage. I haven't tested this PR myself yet. I will revisit this when I merge it in the next release. |
On llama.cpp I can't reproduce the issue. On my machine VRAM usage is the exact same. |
Then, I might have compiled a little Frankenstein, Johannes. Maybe it could be about Cuda 11.4, I don't know. I'll compile and test your next experimental build, Lostruins, once it includes this PR, and report here if I still have the issue. |
The problem of memory leak is solved for me with the last experimental build of KoboldCPP including the present commit and its later revision (f514d1b), and compiled also with that additional PR: #2529), on the same model with the same settings. |
This PR adds CUDA performance optimization for non k-quant (mul_mat_q) kernels. The changes are mostly two things:
These are the results on my system:
For reference: the speed of cuBLAS is ~1500 t/s on my RTX 3090 and ~500 t/s on my P40. So for non k-quants the mul_mat_q kernels now seem to be universally faster than cuBLAS.