Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark #3889

Merged
merged 3 commits into from
Apr 6, 2024

Conversation

youkaichao
Copy link
Member

Prior to this PR, we use average latency across 3 runs, which seems to be very unstable. Here is what I get across 5 runs:

0.4430512798329194
0.4450072580948472
0.45556532715757686
0.451823682213823
0.4521045722067356

The first step to optimize latency is to have a reliable benchmark. In this PR, I add more warmup run and profile run, and use median number that is robust to outliers.

After this PR, I get 5 runs:

0.43797357752919197
0.4324369733221829
0.43426618119701743
0.4360716128721833
0.43663214799016714

The latency benchmark becomes more reliable and more stable.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Definitely agree that median is a more robust estimator of performance than average.

cc @WoosukKwon since I believe you worked on benchmark_latency.py

@cadedaniel
Copy link
Collaborator

we might want to use p90 as it is stable and covers more of the user experience

@ywang96
Copy link
Member

ywang96 commented Apr 6, 2024

we might want to use p90 as it is stable and covers more of the user experience

Good point, and I guess why not including all three? (like what we did in benchmark_serving.py)

Also, I thought the purpose for benchmark_latency.py is more of to quickly check the model executor's performance (since we only send a batch of requests) after changes so we can see whether there's an improvement or not.

@youkaichao
Copy link
Member Author

@ywang96 @cadedaniel added more percentiles, and kept the old mean as well:

Avg latency: 0.4262307888207336 seconds
10% percentile latency: 0.4236012687906623 seconds
25% percentile latency: 0.4247457890305668 seconds
50% percentile latency: 0.42613378586247563 seconds
75% percentile latency: 0.42765198298729956 seconds
90% percentile latency: 0.42863788856193424 seconds

@ywang96 ywang96 enabled auto-merge (squash) April 6, 2024 21:03
@youkaichao youkaichao changed the title [CI/Benchmark] add more iteration and use median for robust latency benchmark [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark Apr 6, 2024
@ywang96 ywang96 merged commit e4be7d7 into vllm-project:main Apr 6, 2024
35 checks passed
@youkaichao youkaichao deleted the robust_benchmark branch April 6, 2024 21:32
@simon-mo
Copy link
Collaborator

simon-mo commented Apr 6, 2024

Can you do a follow up using buildkite-agent artifact upload. Similar to the annotate. But we should upload a json file with these results so we can visualize it over time.

@youkaichao
Copy link
Member Author

Can you do a follow up using buildkite-agent artifact upload.

I'm not familiar with buildkite-agent. Can you give me some pointers?

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants