[Performance]: Maximizing the performance of batch inference of big models on vllm 0.6.3 #9383
Open
1 task done
Labels
performance
Performance-related issues
Misc discussion on performance
Hi all, I'm having trouble with maximizing the performance of batch inference of big models on vllm 0.6.3
(Llama 3.1 70b, 405b, Mistral large)
My command to run the server is this: "python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Large-Instruct-2407 --tensor-parallel-size 4 --guided-decoding-backend lm-format-enforcer --enable-chunked-prefill --enable-prefix-caching "
Specifically, I'm running on 4xA100 80GB hardware
I launch requests with a large min_tokens and max_tokens value (30,000)
I do n = 8 to get 8 responses and to run them in parallel
It appears that despite the same min and max token values, that my Avg generation throughput starts very high (~100+) and scales down slowly overtime to a crawl (I was seeing 4 tokens/s before I stopped the generation with mistral large). This is making it take a prohibitively long time to get outputs.
I used to have max_tokens set to a very high value but min_tokens set low, and the model usually gave short outputs but was able to consistently keep high tok/s
I need to get outputs in a reasonable time. Setting n lower cripples my t/s and this doesn't appear to be a GPU memory issue, lowering min/max tokens isn't an option (outputs need to be very long). What configuration/settings changes can I do to optimize my inference environment for how I am doing inference?
Your current environment (if you think it is necessary)
Collecting environment information...
Traceback (most recent call last):
File "/home/lain/collect_env.py", line 743, in
main()
File "/home/lain/collect_env.py", line 722, in main
output = get_pretty_env_info()
^^^^^^^^^^^^^^^^^^^^^
File "/home/lain/collect_env.py", line 717, in get_pretty_env_info
return pretty_str(get_env_info())
^^^^^^^^^^^^^^
File "/home/lain/collect_env.py", line 549, in get_env_info
vllm_version = get_vllm_version()
^^^^^^^^^^^^^^^^^^
File "/home/lain/collect_env.py", line 270, in get_vllm_version
from vllm import version, version_tuple
ImportError: cannot import name 'version_tuple' from 'vllm' (/home/lain/micromamba/envs/vllm/lib/python3.11/site-packages/vllm/init.py)
(seems your script is broken)
Vllm 0.6.3
"python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Large-Instruct-2407 --tensor-parallel-size 4 --guided-decoding-backend lm-format-enforcer --enable-chunked-prefill --enable-prefix-caching "
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: