-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the inference performance of the GPTQ model #9240
Comments
@Rssevenyu can you run |
Of course, the following is my environmental information: OS: Ubuntu 20.04.5 LTS (x86_64) Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime) Nvidia driver version: 535.104.12 CPU: Versions of relevant libraries: Legend: X = Self NIC Legend: NIC0: mlx5_0 |
I'm encountering another issue where VLLM reports that 'GPTQ is not fully optimized' when running GPTQ models. Additionally, on my machine, the GPTQ model does not seem to be faster compared to non-quantized models. |
@Rssevenyu same question, have you figured out the reason now? |
Why is it that when using a quantitative model for inference, the TTFT optimization is not obvious, but the overall inference efficiency is improved a lot? At the same time, the inference efficiency of gptq marlin is not as good as gptq? What is the reason?
Version Information:
vLLM Version: 0.6.2
Start-up Commands:
Non-quantized model:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 7807 --model /mnt/home/Qwen1.5_32B_Chat --trust-remote-code --served-model-name Qwen --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --enforce-eager --max-model-len 8192 --enable-prefix-caching
Quantized model using GPTQ (without GPTQ Marlin kernel):
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 7807 --model /mnt/home/Qwen1.5-32B-Chat-GPTQ-Int4 --trust-remote-code --served-model-name Qwen --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --enforce-eager --max-model-len 8192 --enable-prefix-caching --quantization gptq
Quantized model using GPTQ Marlin kernel (automatic mode without specifying --quantization gptq):
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 7807 --model /mnt/home/Qwen1.5-32B-Chat-GPTQ-Int4 --trust-remote-code --served-model-name Qwen --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --enforce-eager --max-model-len 8192 --enable-prefix-caching
Test Setup:
The test script uses 4 concurrent requests with the same prompt for evaluation.
Metric Outputs:
Non-quantized Model:
Time to First Token (TTFT):
vllm:time_to_first_token_seconds_sum{model_name="Qwen"} 2.931025266647339
Time Per Output Token:
vllm:time_per_output_token_seconds_sum{model_name="Qwen"} 6.13854455947876
Quantized Model using GPTQ:
Time to First Token (TTFT):
vllm:time_to_first_token_seconds_sum{model_name="Qwen"} 2.7571163177490234
Time Per Output Token:
vllm:time_per_output_token_seconds_sum{model_name="Qwen"} 3.8764026165008545
Quantized Model using GPTQ Marlin:
Time to First Token (TTFT):
vllm:time_to_first_token_seconds_sum{model_name="Qwen"} 2.9693307876586914
Time Per Output Token:
vllm:time_per_output_token_seconds_sum{model_name="Qwen"} 4.670741319656372
The text was updated successfully, but these errors were encountered: