-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support W8A8 inference in vllm #1508
Conversation
Hi, you can try this branch https://github.com/AniZpZ/vllm/tree/vllmq. |
Hi @AniZpZ @HandH1998 , thanks for your great work. I'm curious about the difference between |
@Hongbosherlock, |
thanks, I have sent you an email, pls check it. |
by the way, in AWQ they smooth the # attn out
# Please refer to https://github.com/mit-han-lab/llm-awq/pull/67#issue-1850622696
if module.self_attn.v_proj.weight.shape == module.self_attn.o_proj.weight.shape:
scales_list.append(_auto_get_scale(
prev_op=module.self_attn.v_proj,
layers=[module.self_attn.o_proj],
inp=input_feat['self_attn.o_proj'],
)) but in Smoothquant(llama-dev) , it is not the same as that, what is the consideration behind this? |
This is really a nice work! I am trying to re-produce the W8A8 results. However, after using AutoSmoothQuant to export int8 llama-2-7b model, modifying the "architectures" in config.json from "Int8LlamaForCausalLM" to "LlamaForCausalLM", I load the model:
in vLLM. I get this error:
I manually checked the params_dict and cannot find this "quant_scale". Any idea how to fix? Many thanks ahead! |
@MingLin-home What is your quant_config.json when running AutoSmoothQuant? It should be |
Thanks for the quick reply! Using this quant config, the error is gone. Great work! |
@AniZpZ We really like this PR! Do you have any plan on re-implement in triton? We are willing to re-implement a triton version as it is more user-friendly. Many thanks ahead! |
We currently do not have plans to implement a Triton version. However, we do have a repository at https://github.com/AniZpZ/AutoSmoothQuant that facilitates easy quantization for LLMs. We would greatly appreciate it if you could reimplement a Triton version and contribute to our repo! |
Thanks @AniZpZ ! We are using AutoSmoothQuant to quantize the model. Will keep you posted on our triton project. Please kindly let me know if you are aware of similar efforts such that we can join the development together. |
========== update ========== ========== old context ======= I was able to convert the Llama-2-7b model to int8 via AutoSmoothQuant. Loading and inference in vLLM without any RuntimeError too. However, I checked the generated text and it is filled with random meaningless tokens. I further confirm that the fp16 original model file works correctly, generating meaningful text. To verify that the converted model via AutoSmoothQuant is good, I test the model in AutoSmoothQuant's test_model.py. The output is human readable. In short, it looks like the vLLM-w8a8 model loading and/or inference are not working correctly. |
@AniZpZ Hi, when I used dynamic_rope in the w8a8 branch to extend the length of the model after smoothquant, the following error occurred. How should I solve it?
|
It seems that you should modify dynamic_rope to accept extra args for dequantization operation. |
Hi @AniZpZ, I'm starting to work on activation quantization (with a couple of other engineers at @neuralmagic) and wanted to point you to this RFC that I just posted #3975. We've been working with this PR as a starting point so are interested in getting your feedback and collaborating if you're interested. |
Hi @tlrmchlsmth Our team proposed this PR very early last year. Later, for the convenience of review, the KV Cache Int8 and W8A8 were split into two PRs. But it seems that the vLLM team's focus are not on this. The review and merging progress is very slow. We will continue to make new quantitative attempts such as W4A8 on vLLM, but at present, our development of LLM Serving for production environment has been migrated to LMDeploy. |
If bother anyone forgive me. |
@babaozhouy5 Thank you for addressing the issue. I think I understand how you fix it. Change the |
Sure! |
您好,我正在使用您的方法进行int8量化,是mixtral模型,但是会提示 我使用的最新版本vllm。请问如何解决呢 |
Closing this PR as vLLM has supported INT8 W8A8 with custom CUTLASS kernels for a while now. See the documentation for pointers on how to find or make INT8 models! https://docs.vllm.ai/en/v0.5.5/quantization/int8.html |
We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of #1112. We have split the huge PR into two independent parts for easier review.
The usage of w8a8 inference is simple(support llama for now):
17th Jan 2024 Updates!!!
We release a repository for implementing SmoothQuant in various models. You can try out export quantized model weights with this repo AutoSmoothQuant
CUDA Graph is not compatible for now~
Updates!!!
We have update the quant method to per token quant for o_proj and down_proj of LLama. Please use the lastest llama-dev branch of smoothquant and per_token_quant branch of torch-int to generate int8 model!!!
You can find more details like how to gernerate int8 weight in original PR #1112
You can use the method with int8 kv cache quant #1507 for best throughput