-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571
Conversation
I cannot reproduce the issue that you reported with memory usage, in my tests the allocator correctly reuses the memory of the previous tensors automatically. The It would also be very desirable to implement this in the CPU backend, and add a test in For cases where the input and outputs of the copy are contiguous (as it is here), then this could also be implemented using the existing dequantize functions in the CUDA backend, which would allow it to work with any format and likely with better performance.
|
ea88404
to
eec216c
Compare
Right, I simplified it. Thanks. But where are dequantize functions you are talking about? You mean |
I also forgot to ask, is there a reason to use f32 over f16? The performance of q8_0 K-Shift doesn't seem great in this PR, I wonder if using f16 can improve it... |
Yes, I mean the functions from |
llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.
Great, it works well in actual use. Is there a plan to support more quantitative specifications, such as q4_0? |
I'd like to improve it more, as @slaren mentioned, but I'm confused a bit by APIs.
If I use convert function like this, what should be in |
Yes, that looks right. |
Hmm, |
llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.
This is a reworked #5653
Some CUDA code was adapted from: 3d92acf
Original PR had explosive GPU memory requirement.
I'm not sure if it's a bug, or intended logic of the allocator.I worked around it by reusing the same tensor. It seems to work well for me.