-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SqueezeLLM Support #1326
SqueezeLLM Support #1326
Conversation
Hi @chooper1, thanks for submitting the PR! Before getting into review, could you check the code format? Please run the following and upstream the changes: pip install -r requirements-dev.txt
./format.sh |
This is super interesting work, especially after the release of the quantization code to produce newly quantized models. I am curious if Woosuk or the author could run benchmarks/benchmark_throughput.py to check the thoughput of FP16 versus SqueezeLLM? EDIT: I am getting very low tokens/s at low batch sizes at around 14.35 tokens/s. Is this the expected performance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chooper1 Sorry for the late review. The PR looks good to me! I've updated it with the latest main branch and modified QuantizationConfig
as I found that the original interface of QuantizationConfig
was overfitted to AWQ. Thanks again for the great work!
As the next step, I hope we can see more SqueezeLLM models, especially Mistral and Falcon. Also, please consider optimizing the matmulu kernel.
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Hi @casper-hansen, It seems there are some issues in SqueezeLLM-gradients? Have you produced SqueezeLLM-gradients for Llama-2-13B, I modified the |
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
This PR adds support for the SqueezeLLM quantization method, which is described in the following preprint: https://arxiv.org/abs/2306.07629, and which has open-source GPU inference code and quantization code available at: https://github.com/SqueezeAILab/SqueezeLLM. SqueezeLLM is a post-training quantization framework that allows for high-accuracy and runtime-efficient quantization at low bit precision. SqueezeLLM leverages non-uniform quantization to better represent the underlying distribution by shifting the quantization signposts to the optimal positions. This PR contains the kernels and quantization configurations files in order to run the 4-bit dense-only non-uniform quantization scheme outlined in the preprint.