Skip to content

[Bug]: The new version (v0.5.4) cannot load the gptq model, but the old version (vllm=0.5.3.post1) can do it. #7240

Closed
@ningwebbeginner

Description

Your current environment

--2024-08-07 03:22:15--  https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25132 (25K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py      100%[===================>]  24.54K  --.-KB/s    in 0.002s  

2024-08-07 03:22:15 (13.9 MB/s) - ‘collect_env.py’ saved [25132/25132]

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             3
BogoMIPS:                             4000.29
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             1 MiB (1 instance)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.12.1
[pip3] pyzmq==24.0.1
[pip3] torch==2.3.1
[pip3] torchaudio==2.3.1+cu121
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.18.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-1		N/A		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

import torch
from vllm import LLM, SamplingParams

# [Replace with the path to your GPTQ model](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)
model_path = '/content/Qwen2-7B-Instruct-GPTQ-Int4'

# Initialize the LLM
llm = LLM(model=model_path, max_model_len=4096)

ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
INFO 08-07 03:14:03 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-07 03:14:03 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/content/Qwen2-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/content/Qwen2-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/content/Qwen2-7B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-07 03:14:04 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 03:14:04 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-07 03:14:06 model_runner.py:720] Starting to load model /content/Qwen2-7B-Instruct-GPTQ-Int4...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-5-3af6a28987e2>](https://localhost:8080/#) in <cell line: 8>()
      6 
      7 # Initialize the LLM
----> 8 llm = LLM(model=model_path, max_model_len=4096)
      9 
     10 # Set sampling parameters

14 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    156             **kwargs,
    157         )
--> 158         self.llm_engine = LLMEngine.from_engine_args(
    159             engine_args, usage_context=UsageContext.LLM_CLASS)
    160         self.request_counter = Counter()

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args, usage_context, stat_loggers)
    443         executor_class = cls._get_executor_cls(engine_config)
    444         # Create the LLM engine.
--> 445         engine = cls(
    446             **engine_config.to_dict(),
    447             executor_class=executor_class,

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers)
    247             self.model_config)
    248 
--> 249         self.model_executor = executor_class(
    250             model_config=model_config,
    251             cache_config=cache_config,

[/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, prompt_adapter_config)
     45         self.prompt_adapter_config = prompt_adapter_config
     46 
---> 47         self._init_executor()
     48 
     49     @abstractmethod

[/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py](https://localhost:8080/#) in _init_executor(self)
     34         self.driver_worker = self._create_worker()
     35         self.driver_worker.init_device()
---> 36         self.driver_worker.load_model()
     37 
     38     def _get_worker_kwargs(

[/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py](https://localhost:8080/#) in load_model(self)
    137 
    138     def load_model(self):
--> 139         self.model_runner.load_model()
    140 
    141     def save_sharded_state(

[/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py](https://localhost:8080/#) in load_model(self)
    720         logger.info("Starting to load model %s...", self.model_config.model)
    721         with CudaMemoryProfiler() as m:
--> 722             self.model = get_model(model_config=self.model_config,
    723                                    device_config=self.device_config,
    724                                    load_config=self.load_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py](https://localhost:8080/#) in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config)
     19               cache_config: CacheConfig) -> nn.Module:
     20     loader = get_model_loader(load_config)
---> 21     return loader.load_model(model_config=model_config,
     22                              device_config=device_config,
     23                              lora_config=lora_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config)
    322         with set_default_torch_dtype(model_config.dtype):
    323             with target_device:
--> 324                 model = _initialize_model(model_config, self.load_config,
    325                                           lora_config, multimodal_config,
    326                                           cache_config, scheduler_config)

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _initialize_model(model_config, load_config, lora_config, multimodal_config, cache_config, scheduler_config)
    150     """Initialize a model with the given configurations."""
    151     model_class = get_model_architecture(model_config)[0]
--> 152     quant_config = _get_quantization_config(model_config, load_config)
    153 
    154     return model_class(config=model_config.hf_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _get_quantization_config(model_config, load_config)
     91     """Get the quantization config."""
     92     if model_config.quantization is not None:
---> 93         quant_config = get_quant_config(model_config, load_config)
     94         capability = current_platform.get_device_capability()
     95         capability = capability[0] * 10 + capability[1]

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py](https://localhost:8080/#) in get_quant_config(model_config, load_config)
    130                                   None)
    131     if hf_quant_config is not None:
--> 132         return quant_cls.from_config(hf_quant_config)
    133     # In case of bitsandbytes/QLoRA, get quant config from the adapter model.
    134     if model_config.quantization == "bitsandbytes":

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in from_config(cls, config)
     82         lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"],
     83                                                  default=False)
---> 84         return cls(weight_bits, group_size, desc_act, is_sym,
     85                    lm_head_quantized)
     86 

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in __init__(self, weight_bits, group_size, desc_act, is_sym, lm_head_quantized)
     49 
     50         # Verify supported on platform.
---> 51         verify_marlin_supported(quant_type=self.quant_type,
     52                                 group_size=self.group_size)
     53 

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py](https://localhost:8080/#) in verify_marlin_supported(quant_type, group_size, has_zp)
     86     if not cond:
     87         assert err_msg is not None
---> 88         raise ValueError(err_msg)
     89 
     90 

ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions