Skip to content

Loading quantized models #392

Closed
Closed
@abhinavkulkarni

Description

@abhinavkulkarni

Hi,

Is there a way to load quantized models using vLLM? For e.g. I have been using AWQ quantization and have released a few models here.

The model loading process looks like the following:

  1. Model is first initialized with empty weights
  2. Linear layers are replaced by a custom linear layer that supports zero-point quantization
  3. Weights are loaded from a checkpoint onto a GPU

Please note the matrix multiplication inside the linear layer is done by a Python extension that uses custom CUDA kernels since AWQ uses 4-bit quantization. Everything else - the attention mechanism, etc. is the same.

The model otherwise supports all the HuggingFace AutoModelForCausalLM APIs.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions