Skip to content

Using Quantized Models with vLLM CPU Backend #134

@miracatici

Description

@miracatici

Describe the bug
When I start vLLM OpenAI server with Meta Llama 3.1 8B instruct, it works.

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct

However, when I try same script with neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 it raises an error. Only difference that I see in the llm server logs is model config with quantization=compressed-tensors

python3 -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8
Exception in worker VllmWorkerProcess while processing method load_model: , Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 217, in load_model
self.model_runner.load_model()
    File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 125, in load_model
self.model = get_model(model_config=self.model_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
model = _initialize_model(model_config, self.load_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 174, in _initialize_model
quant_config=_get_quantization_config(model_config, load_config),
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _get_quantization_config
capability = current_platform.get_device_capability()
    File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/interface.py", line 28, in get_device_capability
raise NotImplementedError
    NotImplementedError

Expected behavior
Run vLLM Openai compatible server, same as uncompressed original Llama 3.1 8B model

Environment
Include all relevant environment information:

  1. OS: Ubuntu 22.04.1
  2. Python version: 3.10.12
  3. ML framework version(s): torch 2.4.0+cpu
  4. Other Python package versions: vLLM 0.5.5+cpu,

To Reproduce

  1. Start Ubuntu 22.04 Docker container with official image
  2. Install vLLM CPU backend as in the docs, build from source
  3. Run script

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions