Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The new version (v0.5.4) cannot load the gptq model, but the old version (vllm=0.5.3.post1) can do it. #7240

Closed
ningwebbeginner opened this issue Aug 7, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@ningwebbeginner
Copy link

Your current environment

--2024-08-07 03:22:15--  https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25132 (25K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py      100%[===================>]  24.54K  --.-KB/s    in 0.002s  

2024-08-07 03:22:15 (13.9 MB/s) - ‘collect_env.py’ saved [25132/25132]

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             3
BogoMIPS:                             4000.29
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             1 MiB (1 instance)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.12.1
[pip3] pyzmq==24.0.1
[pip3] torch==2.3.1
[pip3] torchaudio==2.3.1+cu121
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.18.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-1		N/A		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

import torch
from vllm import LLM, SamplingParams

# [Replace with the path to your GPTQ model](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)
model_path = '/content/Qwen2-7B-Instruct-GPTQ-Int4'

# Initialize the LLM
llm = LLM(model=model_path, max_model_len=4096)

ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
INFO 08-07 03:14:03 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-07 03:14:03 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/content/Qwen2-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/content/Qwen2-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/content/Qwen2-7B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-07 03:14:04 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 03:14:04 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-07 03:14:06 model_runner.py:720] Starting to load model /content/Qwen2-7B-Instruct-GPTQ-Int4...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-5-3af6a28987e2>](https://localhost:8080/#) in <cell line: 8>()
      6 
      7 # Initialize the LLM
----> 8 llm = LLM(model=model_path, max_model_len=4096)
      9 
     10 # Set sampling parameters

14 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
    156             **kwargs,
    157         )
--> 158         self.llm_engine = LLMEngine.from_engine_args(
    159             engine_args, usage_context=UsageContext.LLM_CLASS)
    160         self.request_counter = Counter()

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args, usage_context, stat_loggers)
    443         executor_class = cls._get_executor_cls(engine_config)
    444         # Create the LLM engine.
--> 445         engine = cls(
    446             **engine_config.to_dict(),
    447             executor_class=executor_class,

[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers)
    247             self.model_config)
    248 
--> 249         self.model_executor = executor_class(
    250             model_config=model_config,
    251             cache_config=cache_config,

[/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, prompt_adapter_config)
     45         self.prompt_adapter_config = prompt_adapter_config
     46 
---> 47         self._init_executor()
     48 
     49     @abstractmethod

[/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py](https://localhost:8080/#) in _init_executor(self)
     34         self.driver_worker = self._create_worker()
     35         self.driver_worker.init_device()
---> 36         self.driver_worker.load_model()
     37 
     38     def _get_worker_kwargs(

[/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py](https://localhost:8080/#) in load_model(self)
    137 
    138     def load_model(self):
--> 139         self.model_runner.load_model()
    140 
    141     def save_sharded_state(

[/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py](https://localhost:8080/#) in load_model(self)
    720         logger.info("Starting to load model %s...", self.model_config.model)
    721         with CudaMemoryProfiler() as m:
--> 722             self.model = get_model(model_config=self.model_config,
    723                                    device_config=self.device_config,
    724                                    load_config=self.load_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py](https://localhost:8080/#) in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config)
     19               cache_config: CacheConfig) -> nn.Module:
     20     loader = get_model_loader(load_config)
---> 21     return loader.load_model(model_config=model_config,
     22                              device_config=device_config,
     23                              lora_config=lora_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config)
    322         with set_default_torch_dtype(model_config.dtype):
    323             with target_device:
--> 324                 model = _initialize_model(model_config, self.load_config,
    325                                           lora_config, multimodal_config,
    326                                           cache_config, scheduler_config)

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _initialize_model(model_config, load_config, lora_config, multimodal_config, cache_config, scheduler_config)
    150     """Initialize a model with the given configurations."""
    151     model_class = get_model_architecture(model_config)[0]
--> 152     quant_config = _get_quantization_config(model_config, load_config)
    153 
    154     return model_class(config=model_config.hf_config,

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _get_quantization_config(model_config, load_config)
     91     """Get the quantization config."""
     92     if model_config.quantization is not None:
---> 93         quant_config = get_quant_config(model_config, load_config)
     94         capability = current_platform.get_device_capability()
     95         capability = capability[0] * 10 + capability[1]

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py](https://localhost:8080/#) in get_quant_config(model_config, load_config)
    130                                   None)
    131     if hf_quant_config is not None:
--> 132         return quant_cls.from_config(hf_quant_config)
    133     # In case of bitsandbytes/QLoRA, get quant config from the adapter model.
    134     if model_config.quantization == "bitsandbytes":

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in from_config(cls, config)
     82         lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"],
     83                                                  default=False)
---> 84         return cls(weight_bits, group_size, desc_act, is_sym,
     85                    lm_head_quantized)
     86 

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in __init__(self, weight_bits, group_size, desc_act, is_sym, lm_head_quantized)
     49 
     50         # Verify supported on platform.
---> 51         verify_marlin_supported(quant_type=self.quant_type,
     52                                 group_size=self.group_size)
     53 

[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py](https://localhost:8080/#) in verify_marlin_supported(quant_type, group_size, has_zp)
     86     if not cond:
     87         assert err_msg is not None
---> 88         raise ValueError(err_msg)
     89 
     90 

ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
@ningwebbeginner ningwebbeginner added the bug Something isn't working label Aug 7, 2024
@mars-ch
Copy link

mars-ch commented Aug 7, 2024

+1

@linpan
Copy link

linpan commented Aug 7, 2024

Marlin

@Sk4467
Copy link

Sk4467 commented Aug 7, 2024

+1
Facing the same issue.

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 7, 2024

@LucasWilkinson will take a look

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 7, 2024

Explicitly setting quantization="gptq" should unblock you for now on a T4

We will look into the issue

@12sang3
Copy link

12sang3 commented Aug 8, 2024

+1,请问大家有什么好的解决方法吗

@ningwebbeginner
Copy link
Author

+1,请问大家有什么好的解决方法吗

可以 pip install vllm==0.5.3.post1回到老版本,或者上面有人回复的设置 quantization="gptq" 如果你是用T4的话

@robertgshaw2-neuralmagic
Copy link
Collaborator

Closing because this is fixed by #7264

@HelloCard
Copy link

vllm [v0.5.4], shuyuej/Mistral-Nemo-Instruct-2407-GPTQ-INT8

(base) root@DESKTOP-O6DNFE1:/mnt/c/Windows/system32# CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model /mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8 --max-num-seqs=1 --max-model-len 8192 --gpu-memory-utilization 0.85
INFO 08-08 23:00:06 api_server.py:339] vLLM API server version 0.5.4
INFO 08-08 23:00:06 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.85, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-08 23:00:06 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-08 23:00:06 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-08 23:00:06 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', speculative_config=None, tokenizer='/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-08 23:00:06 utils.py:578] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 08-08 23:00:07 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-08 23:00:07 selector.py:54] Using XFormers backend.
/root/miniconda3/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/root/miniconda3/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-08 23:00:08 model_runner.py:720] Starting to load model /mnt/e/Code/models/Mistral-Nemo-Instruct-2407-GPTQ-INT8...
Process Process-1:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
             ^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model
    model = _initialize_model(model_config, self.load_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 152, in _initialize_model
    quant_config = _get_quantization_config(model_config, load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 93, in _get_quantization_config
    quant_config = get_quant_config(model_config, load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 132, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 84, in from_config
    return cls(weight_bits, group_size, desc_act, is_sym,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 51, in __init__
    verify_marlin_supported(quant_type=self.quant_type,
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 88, in verify_marlin_supported
    raise ValueError(err_msg)
ValueError: Marlin does not support weight_bits = uint8b128. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).

add "--quantization gptq" and then OK.

@12sang3
Copy link

12sang3 commented Aug 9, 2024

p

您好,这是什么意思呢

@dev1ous
Copy link

dev1ous commented Aug 30, 2024

Hello, still the same error on a T4 with 'neuralmagic/Mistral-Nemo-Instruct-2407-quantized.w4a16'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants