[Performance]: The performance of version 0.6.3 is weaker than that of version 0.6.2 in stress testing. #9581

skylee-01 · 2024-10-22T07:37:16Z

Proposal to improve performance

The performance of version 0.6.3 is weaker than that of version 0.6.2 in stress testing.
Scenario: Agent
Stress Testing Data: Input 50 tokens, output 20 tokens.

Version 0.6.3: 22 QPS, with a significant drop after reaching the maximum value.

Version 0.6.2: 24 QPS, with available performance reaching over 90% at 36 QPS. The decline after reaching the maximum value is gradual.

Under the same conditions, comparing version 0.6.3 with 0.6.2, it was found that the prefill time for version 0.6.3 is about 13ms longer per instance than version 0.6.2. The main reason is a pause of 200-300 microseconds between two blocks.
Conditions for obtaining the data: Using batch_size=20, offline use of LLMEngine with 50 iterations to get the data.

https://github.com/skylee-01/experimental_data/blob/main/nsys_vllm_062.nsys-rep

https://github.com/skylee-01/experimental_data/blob/main/nsys_vllm_063.nsys-rep

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-10-22T07:57:57Z

you can follow https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code to biset all the commits, and find the commit to blame.

and if you can come up with a fix, without affecting the performance of irrelevant usecases, that would be better.

vrdn-23 · 2024-10-23T20:35:31Z

Is this related to #9764

skylee-01 · 2024-10-24T01:39:04Z

Is this related to #9764这与#9764 相关吗？

Thank you for your reply, I will verify it later.

skylee-01 added the performance Performance-related issues label Oct 22, 2024

jeejeelee mentioned this issue Oct 25, 2024

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: The performance of version 0.6.3 is weaker than that of version 0.6.2 in stress testing. #9581

[Performance]: The performance of version 0.6.3 is weaker than that of version 0.6.2 in stress testing. #9581

skylee-01 commented Oct 22, 2024

youkaichao commented Oct 22, 2024

vrdn-23 commented Oct 23, 2024

skylee-01 commented Oct 24, 2024

[Performance]: The performance of version 0.6.3 is weaker than that of version 0.6.2 in stress testing. #9581

[Performance]: The performance of version 0.6.3 is weaker than that of version 0.6.2 in stress testing. #9581

Comments

skylee-01 commented Oct 22, 2024

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

youkaichao commented Oct 22, 2024

vrdn-23 commented Oct 23, 2024

skylee-01 commented Oct 24, 2024