-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc]: AutoAWQ quantization example fails #7717
Comments
So probably need to update the vllm example to use an example that actually works.
I have filed a PR to fix the |
Can you post a PR with the change? |
@stas00 AWQ is great BTW. However, if you have some high QPS workloads or offline workloads, I would suggest using activation quantization to get the best performance. With activation quantization, we can use the lower bit tensor cores which have 2x the FLOPs. This means we can accelerate the compute bound regime (which becomes the bottlenecks). AWQ 4 bit will still get the best possible latency for very low QPS regimes (e.g. QPS = 1) but outside of this, act quant will dominate. Some benchmarks analyzing this result in this blog: Here's some examples for how to make activation quantization models for vllm:
I figured this might be useful for you. |
done: #7937 I'd love to experiment with your suggestions, Robert. Do I need to use your fork for that? But first I need to figure out how to reliably measure performance so that I could measure the impact and currently as I reported here #7935 it doesn't scale using openAI client. What benchmarks do you use to compare performance of various quantization techniques? Thank you! |
Nope, you do not need the fork. These methods are all supported in vLLM. re: OpenAI performance. Nick and I are working on it |
📚 The doc issue
The quantization example at https://docs.vllm.ai/en/latest/quantization/auto_awq.html can't be run - it looks like AWQ is looking for safetensors files and https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main doesn't have them.
autoawq=0.2.6
Suggest a potential alternative/fix
I tried another model that has .safetensors files but then it fails with:
I see that this example has been copied from https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#examples and it's identical and broken at the source.
edit: I think the issue is the
datasets
version - I'm able to run this version https://github.com/casper-hansen/AutoAWQ/blob/6f14fc7436d9a3fb5fc69299e4eb37db4ee9c891/examples/quantize.py withdatasets==2.21.0
the version from https://docs.vllm.ai/en/latest/quantization/auto_awq.html still fails as explained above.
The text was updated successfully, but these errors were encountered: