Closed
Description
Your current environment
--2024-08-07 03:22:15-- https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25132 (25K) [text/plain]
Saving to: ‘collect_env.py’
collect_env.py 100%[===================>] 24.54K --.-KB/s in 0.002s
2024-08-07 03:22:15 (13.9 MB/s) - ‘collect_env.py’ saved [25132/25132]
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 3
BogoMIPS: 4000.29
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32 KiB (1 instance)
L1i cache: 32 KiB (1 instance)
L2 cache: 1 MiB (1 instance)
L3 cache: 38.5 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable; SMT Host state unknown
Vulnerability Meltdown: Vulnerable
Vulnerability Mmio stale data: Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Vulnerable
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.12.1
[pip3] pyzmq==24.0.1
[pip3] torch==2.3.1
[pip3] torchaudio==2.3.1+cu121
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.18.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-1 N/A N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
import torch
from vllm import LLM, SamplingParams
# [Replace with the path to your GPTQ model](https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int4)
model_path = '/content/Qwen2-7B-Instruct-GPTQ-Int4'
# Initialize the LLM
llm = LLM(model=model_path, max_model_len=4096)
ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).
INFO 08-07 03:14:03 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-07 03:14:03 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/content/Qwen2-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/content/Qwen2-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/content/Qwen2-7B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-07 03:14:04 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 03:14:04 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-07 03:14:06 model_runner.py:720] Starting to load model /content/Qwen2-7B-Instruct-GPTQ-Int4...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-5-3af6a28987e2>](https://localhost:8080/#) in <cell line: 8>()
6
7 # Initialize the LLM
----> 8 llm = LLM(model=model_path, max_model_len=4096)
9
10 # Set sampling parameters
14 frames
[/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py](https://localhost:8080/#) in __init__(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
156 **kwargs,
157 )
--> 158 self.llm_engine = LLMEngine.from_engine_args(
159 engine_args, usage_context=UsageContext.LLM_CLASS)
160 self.request_counter = Counter()
[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in from_engine_args(cls, engine_args, usage_context, stat_loggers)
443 executor_class = cls._get_executor_cls(engine_config)
444 # Create the LLM engine.
--> 445 engine = cls(
446 **engine_config.to_dict(),
447 executor_class=executor_class,
[/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers)
247 self.model_config)
248
--> 249 self.model_executor = executor_class(
250 model_config=model_config,
251 cache_config=cache_config,
[/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py](https://localhost:8080/#) in __init__(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, prompt_adapter_config)
45 self.prompt_adapter_config = prompt_adapter_config
46
---> 47 self._init_executor()
48
49 @abstractmethod
[/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py](https://localhost:8080/#) in _init_executor(self)
34 self.driver_worker = self._create_worker()
35 self.driver_worker.init_device()
---> 36 self.driver_worker.load_model()
37
38 def _get_worker_kwargs(
[/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py](https://localhost:8080/#) in load_model(self)
137
138 def load_model(self):
--> 139 self.model_runner.load_model()
140
141 def save_sharded_state(
[/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py](https://localhost:8080/#) in load_model(self)
720 logger.info("Starting to load model %s...", self.model_config.model)
721 with CudaMemoryProfiler() as m:
--> 722 self.model = get_model(model_config=self.model_config,
723 device_config=self.device_config,
724 load_config=self.load_config,
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py](https://localhost:8080/#) in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, multimodal_config, cache_config)
19 cache_config: CacheConfig) -> nn.Module:
20 loader = get_model_loader(load_config)
---> 21 return loader.load_model(model_config=model_config,
22 device_config=device_config,
23 lora_config=lora_config,
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in load_model(self, model_config, device_config, lora_config, multimodal_config, parallel_config, scheduler_config, cache_config)
322 with set_default_torch_dtype(model_config.dtype):
323 with target_device:
--> 324 model = _initialize_model(model_config, self.load_config,
325 lora_config, multimodal_config,
326 cache_config, scheduler_config)
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _initialize_model(model_config, load_config, lora_config, multimodal_config, cache_config, scheduler_config)
150 """Initialize a model with the given configurations."""
151 model_class = get_model_architecture(model_config)[0]
--> 152 quant_config = _get_quantization_config(model_config, load_config)
153
154 return model_class(config=model_config.hf_config,
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py](https://localhost:8080/#) in _get_quantization_config(model_config, load_config)
91 """Get the quantization config."""
92 if model_config.quantization is not None:
---> 93 quant_config = get_quant_config(model_config, load_config)
94 capability = current_platform.get_device_capability()
95 capability = capability[0] * 10 + capability[1]
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py](https://localhost:8080/#) in get_quant_config(model_config, load_config)
130 None)
131 if hf_quant_config is not None:
--> 132 return quant_cls.from_config(hf_quant_config)
133 # In case of bitsandbytes/QLoRA, get quant config from the adapter model.
134 if model_config.quantization == "bitsandbytes":
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in from_config(cls, config)
82 lm_head_quantized = cls.get_from_keys_or(config, ["lm_head"],
83 default=False)
---> 84 return cls(weight_bits, group_size, desc_act, is_sym,
85 lm_head_quantized)
86
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py](https://localhost:8080/#) in __init__(self, weight_bits, group_size, desc_act, is_sym, lm_head_quantized)
49
50 # Verify supported on platform.
---> 51 verify_marlin_supported(quant_type=self.quant_type,
52 group_size=self.group_size)
53
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py](https://localhost:8080/#) in verify_marlin_supported(quant_type, group_size, has_zp)
86 if not cond:
87 assert err_msg is not None
---> 88 raise ValueError(err_msg)
89
90
ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).