-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8bit support #295
Comments
same with #214 |
Would love to see bitsandbytes integration to load models in 8 and 4-bit quantized mode. |
Quantization support is crucial. 8 and 4 bit support is a must |
why I use fastchat-vllm to inference vicuna-13B,It took 75 G of video memory(A800, 80G) @mymusise |
@cabbagetalk you can add |
This would especially useful for running the new |
在这里看到熟人啊 |
Does vllm support it now? |
Any fix for this issue? |
Hi guys, do you have plan to support it? |
any fix for integrating bitsandbytes? |
No fix, but the feature request is #4033 |
Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes
In HF, we can run a 13B LLM on a 24G GPU with
load_in_8bit=True
.Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.
In the cloud, v100-32G is more expensive than A5000-24G 😭
Is there any way to save video memory usage? 😭
The text was updated successfully, but these errors were encountered: