-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Description
Your current environment
The output of python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : Rocky Linux 9.4 (Blue Onyx) (x86_64)
GCC version : (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version : Could not collect
CMake version : Could not collect
Libc version : glibc-2.34
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.10.16 | packaged by conda-forge | (main, Dec 5 2024, 14:16:10) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.34
==============================
CUDA [/](https://vscode-remote+ood-002ediscovery-002eneu-002eedu.vscode-resource.vscode-cdn.net/) GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version : 545.23.08
cuDNN version : Could not collect
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7543 32-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
Stepping: 1
BogoMIPS: 5589.44
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 invpcid_single hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq overflow_recov succor smca
Virtualization: AMD-V
L1d cache: 2 MiB (64 instances)
L1i cache: 2 MiB (64 instances)
L2 cache: 32 MiB (64 instances)
L3 cache: 512 MiB (16 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63
==============================
Versions of relevant libraries
==============================
[pip3] gptqmodel==2.2.0+cu121torch2.5
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] onnx==1.17.0
[pip3] onnx-simplifier==0.4.36
[pip3] onnxruntime==1.21.0
[pip3] onnxscript==0.2.2
[pip3] pyzmq==26.3.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.3
[pip3] triton==3.3.0
[conda] gptqmodel 2.2.0+cu121torch2.5 pypi_0 pypi
[conda] numpy 1.26.4 py310hb13e2d6_0 conda-forge
[conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi
[conda] nvidia-cufile-cu12 1.11.1.6 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi
[conda] pyzmq 26.3.0 py310h71f11fc_0 conda-forge
[conda] torch 2.7.0 pypi_0 pypi
[conda] torchaudio 2.7.0 pypi_0 pypi
[conda] torchvision 0.22.0 pypi_0 pypi
[conda] transformers 4.52.3 pypi_0 pypi
[conda] triton 3.3.0 pypi_0 pypi
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 24-31 3 N/A
NIC0 SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
==============================
Environment Variables
==============================
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_madhusudhanan.a
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VERSION=12.1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
CUDA_PATH=/shared/centos7/cuda/12.1
CUDNN_VERSION=8.9.3
LD_LIBRARY_PATH=/.singularity.d/libs
CUDA_HOME=/shared/centos7/cuda/12.1
CUDA_HOME=/shared/centos7/cuda/12.1
CUDA_MODULE_LOADING=LAZY
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
I'm running into an issue where a GPTQ-quantized version of Qwen2-VL-2B-Instruct (quantized using the GPTQModel library) produces coherent results using Hugging Face's transformers, but yields poor output when used with vLLM. I have attached the inference code used in both Hugging Face and vLLM below for reproducing the output.
The interesting part is that the quantized models provided by Qwen developers (Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4) work as expected with both vLLM and HF using the same code.
Output Generated by vLLM : "user\nassistant\nuser\nassistant\nI'm sorry, but I can't assist with that."
Output Generated by HF: "['The image shows a woman sitting on a sandy beach at sunset. She is wearing a plaid shirt and is smiling as she high-fives a large dog. The dog is wearing a colorful harness and is sitting on the sand. The background features the ocean with gentle waves, and the sky is clear with a warm, golden hue from the setting sun. The overall atmosphere is serene and joyful.']"
vLLM Code :
import torch
from huggingface_hub import snapshot_download
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
from vllm.lora.request import LoRARequest
from typing import NamedTuple, Optional
class ModelRequestData(NamedTuple):
engine_args: EngineArgs
prompts: list[str]
stop_token_ids: Optional[list[int]] = None
lora_requests: Optional[list[LoRARequest]] = None
def run_qwen2_vl(questions: list[str], modality: str) -> ModelRequestData:
model_name = "arunmadhusudh/Qwen2-VL-2B-Instruct-4bit-GPTQ_A100"
engine_args = EngineArgs(
model=model_name,
max_model_len=4096,
max_num_seqs=5,
enable_lora=False,
mm_processor_kwargs={
"min_pixels": 28 * 28,
"max_pixels": 1280 * 28 * 28,
},
quantization="gptq_marlin",
limit_mm_per_prompt={modality: 1},
trust_remote_code=True,
max_seq_len_to_capture = 48000
)
if modality == "image":
placeholder = "<|image_pad|>"
elif modality == "video":
placeholder = "<|video_pad|>"
prompts = [
("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n") for question in questions
]
return ModelRequestData(
engine_args=engine_args,
prompts=prompts,
)
from dataclasses import asdict
from PIL import Image
import torch
from vllm import LLM, EngineArgs, SamplingParams
modality = "image"
data = Image.open("/home/madhusudhanan.a/vlms/demo.jpeg")
questions = ["What is happening in this image."]
req_data = run_qwen2_vl(questions,modality)
default_limits = {"image": 0, "video": 0, "audio": 0}
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
req_data.engine_args.limit_mm_per_prompt or {})
engine_args = asdict(req_data.engine_args)
llm = LLM(**engine_args)
prompts = req_data.prompts
sampling_params = SamplingParams(temperature=0.2,
max_tokens=128,
stop_token_ids=req_data.stop_token_ids)
inputs = {
"prompt": prompts[0],
"multi_modal_data": {
modality: data
},
}
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
# lora_request=lora_request,
)
HF code :
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
"arunmadhusudh/Qwen2-VL-2B-Instruct-4bit-GPTQ_A100",
torch_dtype=torch.float16,
device_map="auto",
)
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("arunmadhusudh/Qwen2-VL-2B-Instruct-4bit-GPTQ_A100", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.