-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWQ: Up to 2.66x higher throughput #2566
Conversation
Does it affect the accuracy? |
@casper-hansen that's really cool, in line with the bench here where indeed cublas (that exllama kernel is using for longer sequences) is just better than the AWQ GEMM kernel. I think it would make our life easier if we had the same kind of dispatch for marlin. |
This should have no impact on accuracy. The dequantization kernel is strictly equivalent to the dequantization from the GEMM kernel.
Yes, I agree that the Marlin kernels could achieve even higher throughput. The most crucial part is just missing - it’s only for symmetric quantization. |
is it that crucial though? Many int4*fp16 models use symmetric weight quantization successfully |
It may turn out to just be an engineering problem, but from my limited experience, the most popular symmetric weight quantization methods suffer from a higher quantization error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @casper-hansen, thanks for submitting the PR! Left some minor comments. Please take a look.
@WoosukKwon Thanks for the review. I applied your suggested fixes and tested that throughput is as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for the fix!
I tested Llama-13b on A30 with tensor parallel size is 4, and I found awq throughput is lower than fp16. |
This is as expected. You cannot exceed W16A16 performance with W4A16 when you test for throughput. You would need W4A4 (Atom, lower quality model) or W8A8 (SmoothQuant, also lower quality model). This is because W4A16 methods require dequantization, so when you test throughput, you become compute bound and then it limits the performance. EDIT: The throughput can also be lower if the TP implementation is not optimized for quantized models. Not sure if it is in vLLM |
out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0) | ||
out = torch.matmul(reshaped_x, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to learn: would this copy the dequantized weights back to the memory before doing torch.matmul? And a potential optimization is through implementing a more efficient mixed precision matmul that saves 1 data transfer to the memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right that there is potential to eliminate overhead. Exllama runs dequantization and then directly calls cublas for matmul inside the same CUDA kernel. Definitely something to explore!
The strategy is to dequantize and run FP16 matmul for longer sequences. This could probably be faster if we just used cublas instead of
torch.matmul
.EDIT: It seems throughput can be over 2x in vLLM because context processing is such a crucial part of the framework.
Tested on 1x A100 (80GB).