Closed as not planned
Description
Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes
In HF, we can run a 13B LLM on a 24G GPU with load_in_8bit=True
.
Although PageAttention can save 25% of GPU memory, but we have to deploy a 13B LLM on a 26G GPU, at least.
In the cloud, v100-32G is more expensive than A5000-24G 😭
Is there any way to save video memory usage? 😭
Metadata
Metadata
Assignees
Labels
No labels