Closed
Description
Hi,
Is there a way to load quantized models using vLLM
? For e.g. I have been using AWQ quantization and have released a few models here.
The model loading process looks like the following:
- Model is first initialized with empty weights
- Linear layers are replaced by a custom linear layer that supports zero-point quantization
- Weights are loaded from a checkpoint onto a GPU
Please note the matrix multiplication inside the linear layer is done by a Python extension that uses custom CUDA kernels since AWQ uses 4-bit quantization. Everything else - the attention mechanism, etc. is the same.
The model otherwise supports all the HuggingFace AutoModelForCausalLM
APIs.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels