Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I'm using InternVL3-8B-AWQ to inference on vLLM, which is much slower than InternVL3-8B.
The decive I'm using:
RTX4090D 24G
vLLM==0.9.0
Time cost for return of first token:
InternVL3-8B 0.81s
InternVL3-8B-AWQ 1.40s
Command Settings:
python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code
Wonder if there is any special settings needed?
Reproduction
python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code
Environment
RTX4090D 24G
vLLM==0.9.0