Skip to content

[Bug] InternVL3-8B-AWQ is much slower than InternVL3-8B #1057

Open
@GooseHuang

Description

@GooseHuang

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I'm using InternVL3-8B-AWQ to inference on vLLM, which is much slower than InternVL3-8B.

The decive I'm using:
RTX4090D 24G
vLLM==0.9.0

Time cost for return of first token:
InternVL3-8B 0.81s
InternVL3-8B-AWQ 1.40s

Command Settings:
python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code

Wonder if there is any special settings needed?

Reproduction

python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code

Environment

RTX4090D 24G
vLLM==0.9.0

Error traceback

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions