Closed
Description
Proposal to improve performance
No response
Report of performance regression
Hardware: 4x RTX 3070 = 32GB VRAM
Issue: I was able to run Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
with 12K context length with 0.6.x, now with 0.7.0 + VLLM_USE_V1=1
I cannot push the context length higher than 3K or encountering a CUDA OOM error.
Of course, I can reconfigure it to avoid OOM, my question is: Is V1 expected to consume more memory?
Some of the libraries:
flashinfer==0.1.6+cu124torch2.4
torch==2.5.1
transformers==4.48.1
vllm==0.7.0
VLLM command
- vllm
- serve
- Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
- --gpu-memory-utilization=1
- --tensor-parallel-size=4
- --load-format=auto
- --enforce-eager
- --swap-space=0
- --max-model-len=12K
- --max-num-batched-tokens=12K
- --disable-fastapi-docs
- --trust-remote-code
- --enable-auto-tool-choice
- --tool-call-parser=hermes
Thanks
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.