[Bug]:GPTQ-quantized Qwen2-VL-2B-Instruct produces poor output in vLLM but works correctly in HuggingFace transformers

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Rocky Linux 9.4 (Blue Onyx) (x86_64)
GCC version                  : (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.34

==============================
       CUDA [/](https://vscode-remote+ood-002ediscovery-002eneu-002eedu.vscode-resource.vscode-cdn.net/) GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version        : 545.23.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       48 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Vendor ID:           AuthenticAMD
Model name:          AMD EPYC 7543 32-Core Processor
CPU family:          25
Model:               1
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           2
Stepping:            1
BogoMIPS:            5589.44
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 invpcid_single hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq overflow_recov succor smca
Virtualization:      AMD-V
L1d cache:           2 MiB (64 instances)
L1i cache:           2 MiB (64 instances)
L2 cache:            32 MiB (64 instances)
L3 cache:            512 MiB (16 instances)
NUMA node(s):        8
NUMA node0 CPU(s):   0-7
NUMA node1 CPU(s):   8-15
NUMA node2 CPU(s):   16-23
NUMA node3 CPU(s):   24-31
NUMA node4 CPU(s):   32-39
NUMA node5 CPU(s):   40-47
NUMA node6 CPU(s):   48-55
NUMA node7 CPU(s):   56-63

==============================
Versions of relevant libraries
==============================
[pip3] gptqmodel==2.2.0+cu121torch2.5
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] onnx==1.17.0
[pip3] onnx-simplifier==0.4.36
[pip3] onnxruntime==1.21.0
[pip3] onnxscript==0.2.2
[pip3] pyzmq==26.3.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.3
[pip3] triton==3.3.0
[conda] gptqmodel                 2.2.0+cu121torch2.5          pypi_0    pypi
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
[conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
[conda] nvidia-cufile-cu12        1.11.1.6                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
[conda] pyzmq                     26.3.0          py310h71f11fc_0    conda-forge
[conda] torch                     2.7.0                    pypi_0    pypi
[conda] torchaudio                2.7.0                    pypi_0    pypi
[conda] torchvision               0.22.0                   pypi_0    pypi
[conda] transformers              4.52.3                   pypi_0    pypi
[conda] triton                    3.3.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  	GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	24-31	3		N/A
NIC0	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_madhusudhanan.a
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VERSION=12.1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
CUDA_PATH=/shared/centos7/cuda/12.1
CUDNN_VERSION=8.9.3
LD_LIBRARY_PATH=/.singularity.d/libs
CUDA_HOME=/shared/centos7/cuda/12.1
CUDA_HOME=/shared/centos7/cuda/12.1
CUDA_MODULE_LOADING=LAZY
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
```

</details>


### 🐛 Describe the bug

I'm running into an issue where a GPTQ-quantized version of Qwen2-VL-2B-Instruct (quantized using the GPTQModel library) produces coherent results using Hugging Face's transformers, but yields poor output when used with vLLM. I have attached the inference code used in both Hugging Face and vLLM below for reproducing the output.
The interesting part is that the quantized models provided by Qwen developers (Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4) work as expected with both vLLM and HF using the same code.

Output Generated by vLLM : "user\nassistant\nuser\nassistant\nI'm sorry, but I can't assist with that."

Output Generated by HF: "['The image shows a woman sitting on a sandy beach at sunset. She is wearing a plaid shirt and is smiling as she high-fives a large dog. The dog is wearing a colorful harness and is sitting on the sand. The background features the ocean with gentle waves, and the sky is clear with a warm, golden hue from the setting sun. The overall atmosphere is serene and joyful.']"

vLLM Code :

```python
import torch
from huggingface_hub import snapshot_download

from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
from vllm.lora.request import LoRARequest

from typing import NamedTuple, Optional

class ModelRequestData(NamedTuple):
    engine_args: EngineArgs
    prompts: list[str]
    stop_token_ids: Optional[list[int]] = None
    lora_requests: Optional[list[LoRARequest]] = None

def run_qwen2_vl(questions: list[str], modality: str) -> ModelRequestData:

    model_name = "arunmadhusudh/Qwen2-VL-2B-Instruct-4bit-GPTQ_A100"

    engine_args = EngineArgs(
        model=model_name,
        max_model_len=4096,
        max_num_seqs=5,
        enable_lora=False,
        mm_processor_kwargs={
            "min_pixels": 28 * 28,
            "max_pixels": 1280 * 28 * 28,
        },
        quantization="gptq_marlin",
        limit_mm_per_prompt={modality: 1},
        trust_remote_code=True,
        max_seq_len_to_capture = 48000
    )

    if modality == "image":
        placeholder = "<|image_pad|>"
    elif modality == "video":
        placeholder = "<|video_pad|>"

    prompts = [
        ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
         f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
         f"{question}<|im_end|>\n"
         "<|im_start|>assistant\n") for question in questions
    ]

    return ModelRequestData(
        engine_args=engine_args,
        prompts=prompts,
    )

from dataclasses import asdict
from PIL import Image
import torch

from vllm import LLM, EngineArgs, SamplingParams

modality = "image"
data = Image.open("/home/madhusudhanan.a/vlms/demo.jpeg")
questions = ["What is happening in this image."]

req_data = run_qwen2_vl(questions,modality)

default_limits = {"image": 0, "video": 0, "audio": 0}
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
    req_data.engine_args.limit_mm_per_prompt or {})

engine_args = asdict(req_data.engine_args)
llm = LLM(**engine_args)

prompts = req_data.prompts
sampling_params = SamplingParams(temperature=0.2,
                                  max_tokens=128,
                                  stop_token_ids=req_data.stop_token_ids)

inputs = {
    "prompt": prompts[0],
    "multi_modal_data": {
        modality: data
    },
}

outputs = llm.generate(
    inputs,
    sampling_params=sampling_params,
    # lora_request=lora_request,
)
```

HF code :

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "arunmadhusudh/Qwen2-VL-2B-Instruct-4bit-GPTQ_A100",
    torch_dtype=torch.float16,
    device_map="auto",
)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("arunmadhusudh/Qwen2-VL-2B-Instruct-4bit-GPTQ_A100", min_pixels=min_pixels, max_pixels=max_pixels)


messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]:GPTQ-quantized Qwen2-VL-2B-Instruct produces poor output in vLLM but works correctly in HuggingFace transformers #18976

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]:GPTQ-quantized Qwen2-VL-2B-Instruct produces poor output in vLLM but works correctly in HuggingFace transformers #18976

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions