-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading quantized models #392
Comments
Hi @abhinavkulkarni, thanks for exploring vLLM and requesting the feature! For now, vLLM doesn't support quantized models. But as you mentioned, we believe it is not very difficult to add it into vLLM. We will definitely look into it after finishing other urgent issues (e.g., Falcon support, and bug fixes). Besides, we are looking forward to contributions from community! Please make a PR if you have time to implement it. |
Hello, I am considering adding support for 4 / 8 bit quantization into VLLM. What specifically would have to be done to implement this feature? Can I emulate the on the fly style weight quantization done in the fast chat repo?: https://github.com/lm-sys/FastChat/blob/main/fastchat/model/compression.py |
@johnnysmithy123467: It would be best if the models are loaded upfront before they are passed to vLLM. Usually, quantized models modify the structure of the base model by replacing You can see how GPTQ quantized models are loaded here and AWQ quantized models here. |
@TheBloke what would it take to add bits and bytes support? considering implementing it myself and just want to know how I would go about doing it, seems the weight loading schemes are very customized on vllm |
I like the theoretical speedup but the problem is they do not support serialization/deserialization of the 4-bit models. Until that is resolved, it will not be great for production usage. |
@zhuohan123 support for bnb-nf4 would be amazing. Especially for larger models. If you can point me in the right direction, I’d be happy to take a crack at implementing it |
-1 for BNB and +1 for AWQ. I will just let my test results speak for themselves. This is for the model System: 1 x RTX A6000, 4 vCPU, 61 GB RAM (RunPod Community Cloud) This is torch 2.0.1, bitsandbytes 0.41.0, transformers 4.31.0, and AWQ compiled from main. CC @TheBloke and @WoosukKwon |
117 tokens/s now on 4090 with MPT-7B model. |
Was thinking of maybe working on this myself. My understanding is that I only need to replace the linear layers. And I dont need to touch the attention code. Its a lot to simpler to implement if I dont need to get into the attention code with its custom C++ kernels and paged attention. Can someone confirm my understanding there? I was going to use GPTQ and use text-generation-inference library as a template for how to implement this. |
@ri938 That sounds right. Let us know how it goes, eager to hear from you. at least someone's taking a crack at it! |
Update on this: I have a hacky proof of concept quantisation working which reduces memory use and inference quality looks high quality. However inference speed is much slower. Even with a single batch size. I tried the default and exllama kernels from TGI repo but both were about 60% the inference speed as with fp16 models. Its too slow for my particular use case and I think its important to get it to comparable inference speeds as fp16. One option is I could open a PR for quantisation and then allow others with more experience in CUDA optimisations to then contribute to this. Otherwise I probably will continue to work on this. |
GPTQ is pretty slow in general unless you build hyper optimized kernels on top. Did you try AWQ, might give you better results? |
Not tried AWQ. But for GPTQ I did use the kernels from text-generation-inference (including a port of exllama) so I thought this should be good performance. |
AWQ much faster than GPTQ. I feel it could be optimised a lot more and plan to give this a try. Performance declines with larger batch sizes faster than fp16 version. Which means I can only reach batch size half of what I could with FP16 before reaching the peak throughput. Think I'll maybe make a merge request and others with more experience with CUDA can continue to optimise it further. |
Hey @ri938, I'd be happy to contribute to your repository if you'd like. |
We need to add support for the quantized model in the VLLM project. We need this to run a llama quantized model via vllm. This involves implementing quantization techniques to optimize memory usage and runtime performance. A reward of $500 will be granted to the contributor who successfully completes this task. |
work on this so far. Some TODOs and cleanup still required. I tried lots of different quantization methods and AWQ performed the best. WIP: if anyone wants to contribute more can send a MR An initial review and some comments on what we still need to do to merge this would be appreciated. @WoosukKwon |
I think there is lots of room for optimizations because quantization scales poorly with batch size. But I reckon thats a seperate issue. |
My take on the issue is that quantization is a useful and important feature for LLMs, as it can enable faster and more efficient inference of such large language models. I think it would be beneficial to add support for different quantization techniques, such as bnb_nf4 , GPTQ, and AWQ, and allow users to choose the best one for their use case. I also think it would be helpful to provide some documentation and examples on how to use quantized models with vLLM. I appreciate the efforts of the contributors who are working on this issue and I hope they can successfully complete it soon. |
Have you guys watched the OmniQuant? I think is much easier to integrate OmniQuant in vLLM than GPTQ or AWQ. |
OmniQuant paper for reference: https://arxiv.org/abs/2308.13137 |
So, just in case you missed the exciting news, vLLM now proudly supports AWQ 🎉. |
I conducted performance tests on codellama-13B-AWQ, and the results are as follows: this is why ? Looking forward to your reply very much! |
The REDEME of AutoAWQ explain this issue, AWQ is not good in large context or large batch scenario. If you try more test case, you will also found that the prefilling latency of AWQ model is much larger than FP16 model. |
Thank you very much for your reply, i got it! |
@shatealaboxiaowang Right now I got two directions:
|
We have just released per token w8a8 method in #1508 and now it is fully usable. |
Is there any information on when vLLM might support bnb quantization? |
@shatealaboxiaowang Hi! I need to pass already quantized model or vllm will quantize it while loading ? |
Hi @KobanBanan, you need to quantize the model first, then load it with vllm follow your quantize configuration. |
With all the extra fun refactors Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
This PR raises the allowed relative tolerance in GSM8K to 0.06, and moves Llama-70B test to 4xG2 from 2xG2 until memory usage is investigated (success run: vLLM-CI-Pipeline/206)
Hi,
Is there a way to load quantized models using
vLLM
? For e.g. I have been using AWQ quantization and have released a few models here.The model loading process looks like the following:
Please note the matrix multiplication inside the linear layer is done by a Python extension that uses custom CUDA kernels since AWQ uses 4-bit quantization. Everything else - the attention mechanism, etc. is the same.
The model otherwise supports all the HuggingFace
AutoModelForCausalLM
APIs.Thanks!
The text was updated successfully, but these errors were encountered: