Using Quantized Models with vLLM CPU Backend

**Describe the bug**
When I start vLLM OpenAI server with Meta Llama 3.1 8B instruct, it works.
```
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct
```

However, when I try same script with `neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8` it raises an error. Only difference that I see in the llm server logs is model config with `quantization=compressed-tensors`

```
python3 -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8
```
    
```
Exception in worker VllmWorkerProcess while processing method load_model: , Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 217, in load_model
self.model_runner.load_model()
    File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 125, in load_model
self.model = get_model(model_config=self.model_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
model = _initialize_model(model_config, self.load_config,
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 174, in _initialize_model
quant_config=_get_quantization_config(model_config, load_config),
    File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _get_quantization_config
capability = current_platform.get_device_capability()
    File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/interface.py", line 28, in get_device_capability
raise NotImplementedError
    NotImplementedError
```
**Expected behavior**
Run vLLM Openai compatible server, same as uncompressed original Llama 3.1 8B model

**Environment**
Include all relevant environment information:
1. OS: Ubuntu 22.04.1
2. Python version: 3.10.12
3. ML framework version(s): torch 2.4.0+cpu
3. Other Python package versions: vLLM 0.5.5+cpu, 

**To Reproduce**
1. Start Ubuntu 22.04 Docker container with official image
2. Install vLLM CPU backend as in the docs, build from source
3. Run script


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using Quantized Models with vLLM CPU Backend #134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using Quantized Models with vLLM CPU Backend #134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions