-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark #3889
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Definitely agree that median is a more robust estimator of performance than average.
cc @WoosukKwon since I believe you worked on benchmark_latency.py
we might want to use p90 as it is stable and covers more of the user experience |
Good point, and I guess why not including all three? (like what we did in Also, I thought the purpose for |
@ywang96 @cadedaniel added more percentiles, and kept the old mean as well:
|
Can you do a follow up using buildkite-agent artifact upload. Similar to the annotate. But we should upload a json file with these results so we can visualize it over time. |
I'm not familiar with buildkite-agent. Can you give me some pointers? |
Prior to this PR, we use average latency across 3 runs, which seems to be very unstable. Here is what I get across 5 runs:
The first step to optimize latency is to have a reliable benchmark. In this PR, I add more warmup run and profile run, and use median number that is robust to outliers.
After this PR, I get 5 runs:
The latency benchmark becomes more reliable and more stable.