-
Notifications
You must be signed in to change notification settings - Fork 132
bench: revert custom bench run length on stable benchmarks #2665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mozilla#2655 increased the benchmark runtime across all benchmarks. This made benchmarks more stable. Looking at the results of a recent run, this does not seem to be necessary for smaller benchmarks. Thus this commit reverts the custom setting for some in order to save CI runtime. https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533
For the record, currently our https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533 |
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to 070ff3a. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark results1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client:time: [722.96 ms 743.85 ms 766.88 ms] thrpt: [130.40 MiB/s 134.44 MiB/s 138.32 MiB/s] Found 13 outliers among 100 measurements (13.00%) 13 (13.00%) high severe 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client:time: [313.44 ms 314.67 ms 315.88 ms] thrpt: [31.658 Kelem/s 31.780 Kelem/s 31.904 Kelem/s] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client:time: [34.382 ms 35.594 ms 36.802 ms] thrpt: [27.173 elem/s 28.095 elem/s 29.085 elem/s] 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client:time: [7.3236 s 7.3372 s 7.3509 s] thrpt: [13.604 MiB/s 13.629 MiB/s 13.654 MiB/s] decode 4096 bytes, mask ff:time: [11.791 µs 11.819 µs 11.855 µs] Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low severe 3 (3.00%) low mild 7 (7.00%) high severe decode 1048576 bytes, mask ff:time: [3.0213 ms 3.0305 ms 3.0415 ms] Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) high mild 8 (8.00%) high severe decode 4096 bytes, mask 7f:time: [19.937 µs 19.986 µs 20.042 µs] Found 14 outliers among 100 measurements (14.00%) 1 (1.00%) low severe 1 (1.00%) low mild 12 (12.00%) high severe decode 1048576 bytes, mask 7f:time: [5.0463 ms 5.0575 ms 5.0705 ms] Found 14 outliers among 100 measurements (14.00%) 1 (1.00%) low mild 13 (13.00%) high severe decode 4096 bytes, mask 3f:time: [8.2710 µs 8.3036 µs 8.3424 µs] Found 13 outliers among 100 measurements (13.00%) 3 (3.00%) low mild 3 (3.00%) high mild 7 (7.00%) high severe decode 1048576 bytes, mask 3f:time: [1.5900 ms 1.5969 ms 1.6052 ms] Found 8 outliers among 100 measurements (8.00%) 8 (8.00%) high severe 1000 streams of 1 bytes/multistream:time: [14.562 ns 14.606 ns 14.650 ns] Found 3 outliers among 500 measurements (0.60%) 2 (0.40%) high mild 1 (0.20%) high severe 1000 streams of 1000 bytes/multistream:time: [14.559 ns 14.611 ns 14.664 ns] Found 16 outliers among 500 measurements (3.20%) 3 (0.60%) low mild 10 (2.00%) high mild 3 (0.60%) high severe coalesce_acked_from_zero 1+1 entries:time: [88.398 ns 88.732 ns 89.062 ns] Found 11 outliers among 100 measurements (11.00%) 11 (11.00%) high mild coalesce_acked_from_zero 3+1 entries:time: [105.98 ns 106.33 ns 106.70 ns] Found 11 outliers among 100 measurements (11.00%) 11 (11.00%) high severe coalesce_acked_from_zero 10+1 entries:time: [105.49 ns 105.99 ns 106.55 ns] Found 19 outliers among 100 measurements (19.00%) 3 (3.00%) low severe 5 (5.00%) low mild 2 (2.00%) high mild 9 (9.00%) high severe coalesce_acked_from_zero 1000+1 entries:time: [89.183 ns 89.326 ns 89.489 ns] Found 12 outliers among 100 measurements (12.00%) 8 (8.00%) high mild 4 (4.00%) high severe RxStreamOrderer::inbound_frame():time: [110.31 ms 110.36 ms 110.42 ms] Found 21 outliers among 100 measurements (21.00%) 1 (1.00%) low severe 8 (8.00%) low mild 8 (8.00%) high mild 4 (4.00%) high severe SentPackets::take_ranges:time: [9.7520 µs 9.7921 µs 9.8301 µs] Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) low severe 1 (1.00%) low mild transfer/pacing-false/varying-seeds:time: [34.001 ms 34.036 ms 34.071 ms] transfer/pacing-true/varying-seeds:time: [35.071 ms 35.117 ms 35.163 ms] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild transfer/pacing-false/same-seed:time: [33.981 ms 34.008 ms 34.036 ms] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild transfer/pacing-true/same-seed:time: [35.870 ms 35.910 ms 35.949 ms] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild Client/server transfer resultsTransfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these need 60s of runtime, apart from the last one, which might need more than the default. That transfer test is highly variable by nature. Some of that might depend on how fast your machine is though.
neqo-transport/benches/transfer.rs
Outdated
} | ||
criterion_group!( | ||
transfer, | ||
benchmark_transfer_variable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the transfer ones need the extra time. Maybe less so for the one with a fixed seed, but that would mean splitting the group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 reverted the revert in 700d6e9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with this change, some of the benches appear to have become more flaky again. I'm hoping we'll have additional runners soon, which means the long runtime is less of a blocker for other PRs in flight.
Looking at this one as an example. Yes, it shows a regression. But the runtime across runs ( Thus running them for longer will not change the outcome. Does that make sense @larseggert? Currently we compare the benchmark results with a cached version of I wonder whether we should run the benchmarks for both the current pull request and |
The issue is that GiHub won't ever update a cache entry, it will only create a new one when none is present. We would need to figure out how to use the web API to delete a cache entry. Or we would need to switch to storing bench results as artifacts somehow. I also have https://bugzilla.mozilla.org/show_bug.cgi?id=1966839 open to investigate using https://bencher.dev/, which would replace much of our custom scripting around this. |
I am suggesting to not cache benchmark results at all, but do something along the lines of:
|
I think worth exploring, thanks. Though this does not resolve the underlying issue, namely that we see a lot of unrelated noise on the benchmark runtime, right? |
Ah, got it now. Yes, let's try that approach. Yes, bencher probably wouldn't help here. I was mostly interested in its ability to visualize the performance trajectory over time. |
Closing here since #2682 merged first. |
#2655 increased the benchmark runtime across all benchmarks. This made benchmarks more stable.
Looking at the results of a recent run, this does not seem to be necessary for smaller benchmarks. Thus this commit reverts the custom setting for some in order to save CI runtime.
https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533
Minor optimization. Don't feel strongly about it.