Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqueezeLLM Support #1326

Merged
merged 19 commits into from
Oct 22, 2023
Merged

SqueezeLLM Support #1326

merged 19 commits into from
Oct 22, 2023

Conversation

chooper1
Copy link
Contributor

This PR adds support for the SqueezeLLM quantization method, which is described in the following preprint: https://arxiv.org/abs/2306.07629, and which has open-source GPU inference code and quantization code available at: https://github.com/SqueezeAILab/SqueezeLLM. SqueezeLLM is a post-training quantization framework that allows for high-accuracy and runtime-efficient quantization at low bit precision. SqueezeLLM leverages non-uniform quantization to better represent the underlying distribution by shifting the quantization signposts to the optimal positions. This PR contains the kernels and quantization configurations files in order to run the 4-bit dense-only non-uniform quantization scheme outlined in the preprint.

@WoosukKwon
Copy link
Collaborator

Hi @chooper1, thanks for submitting the PR! Before getting into review, could you check the code format? Please run the following and upstream the changes:

pip install -r requirements-dev.txt
./format.sh

@WoosukKwon WoosukKwon self-requested a review October 12, 2023 07:22
@casper-hansen
Copy link
Contributor

casper-hansen commented Oct 12, 2023

This is super interesting work, especially after the release of the quantization code to produce newly quantized models.

I am curious if Woosuk or the author could run benchmarks/benchmark_throughput.py to check the thoughput of FP16 versus SqueezeLLM?

EDIT: I am getting very low tokens/s at low batch sizes at around 14.35 tokens/s. Is this the expected performance?

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chooper1 Sorry for the late review. The PR looks good to me! I've updated it with the latest main branch and modified QuantizationConfig as I found that the original interface of QuantizationConfig was overfitted to AWQ. Thanks again for the great work!

As the next step, I hope we can see more SqueezeLLM models, especially Mistral and Falcon. Also, please consider optimizing the matmulu kernel.

@WoosukKwon WoosukKwon merged commit 1f24755 into vllm-project:main Oct 22, 2023
2 checks passed
skrider pushed a commit to skrider/vllm that referenced this pull request Oct 27, 2023
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@gesanqiu
Copy link
Contributor

gesanqiu commented Nov 8, 2023

Hi @casper-hansen, It seems there are some issues in SqueezeLLM-gradients? Have you produced SqueezeLLM-gradients for Llama-2-13B, I modified the _model.set_devices() to _model.cuda() and _model.num_linear_layers to 40, and met OOM problem even I set 2 A40(48GB) devices.

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants