Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Maximizing the performance of batch inference of big models on vllm 0.6.3 #9383

Open
1 task done
Hellisotherpeople opened this issue Oct 15, 2024 · 1 comment
Labels
performance Performance-related issues

Comments

@Hellisotherpeople
Copy link

Hellisotherpeople commented Oct 15, 2024

Misc discussion on performance

Hi all, I'm having trouble with maximizing the performance of batch inference of big models on vllm 0.6.3

(Llama 3.1 70b, 405b, Mistral large)

My command to run the server is this: "python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Large-Instruct-2407 --tensor-parallel-size 4 --guided-decoding-backend lm-format-enforcer --enable-chunked-prefill --enable-prefix-caching "

Specifically, I'm running on 4xA100 80GB hardware

I launch requests with a large min_tokens and max_tokens value (30,000)
I do n = 8 to get 8 responses and to run them in parallel

It appears that despite the same min and max token values, that my Avg generation throughput starts very high (~100+) and scales down slowly overtime to a crawl (I was seeing 4 tokens/s before I stopped the generation with mistral large). This is making it take a prohibitively long time to get outputs.

I used to have max_tokens set to a very high value but min_tokens set low, and the model usually gave short outputs but was able to consistently keep high tok/s

I need to get outputs in a reasonable time. Setting n lower cripples my t/s and this doesn't appear to be a GPU memory issue, lowering min/max tokens isn't an option (outputs need to be very long). What configuration/settings changes can I do to optimize my inference environment for how I am doing inference?

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Collecting environment information...
Traceback (most recent call last):
File "/home/lain/collect_env.py", line 743, in
main()
File "/home/lain/collect_env.py", line 722, in main
output = get_pretty_env_info()
^^^^^^^^^^^^^^^^^^^^^
File "/home/lain/collect_env.py", line 717, in get_pretty_env_info
return pretty_str(get_env_info())
^^^^^^^^^^^^^^
File "/home/lain/collect_env.py", line 549, in get_env_info
vllm_version = get_vllm_version()
^^^^^^^^^^^^^^^^^^
File "/home/lain/collect_env.py", line 270, in get_vllm_version
from vllm import version, version_tuple
ImportError: cannot import name 'version_tuple' from 'vllm' (/home/lain/micromamba/envs/vllm/lib/python3.11/site-packages/vllm/init.py)

(seems your script is broken)

Vllm 0.6.3

"python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Large-Instruct-2407 --tensor-parallel-size 4 --guided-decoding-backend lm-format-enforcer --enable-chunked-prefill --enable-prefix-caching "

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Hellisotherpeople Hellisotherpeople added the performance Performance-related issues label Oct 15, 2024
@sir3mat
Copy link

sir3mat commented Oct 25, 2024

Have you tested the output with vllm 0.6.3 on longer inputs, ranging from 8k up to 100k tokens? I tested both versions 0.6.3 and 0.6.3.post1 using models like LLaMA 3 70B with a 128k context and LLaMA 3.2 128k, but both versions produce random tokens as output.

Interestingly, when I run the same tests with vllm 0.6.2, it works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

2 participants