Skip to content

bench: revert custom bench run length on stable benchmarks #2665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented May 26, 2025

#2655 increased the benchmark runtime across all benchmarks. This made benchmarks more stable.

Looking at the results of a recent run, this does not seem to be necessary for smaller benchmarks. Thus this commit reverts the custom setting for some in order to save CI runtime.

https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533


Minor optimization. Don't feel strongly about it.

mozilla#2655 increased the benchmark
runtime across all benchmarks. This made benchmarks more stable.

Looking at the results of a recent run, this does not seem to be
necessary for smaller benchmarks. Thus this commit reverts the custom
setting for some in order to save CI runtime.

https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533
@mxinden mxinden changed the title bench: revert custom benc run length on stable benchmarks bench: revert custom bench run length on stable benchmarks May 26, 2025
@mxinden
Copy link
Member Author

mxinden commented May 26, 2025

For the record, currently our cargo bench step takes > 1h:

image

https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533

Copy link

github-actions bot commented May 26, 2025

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 070ff3a.

neqo-latest as client

neqo-latest as server

  • aioquic vs. neqo-latest: run cancelled after 20 min
  • go-x-net vs. neqo-latest: H 🚀DC ⚠️B L2 6 ⚠️CM
  • kwik vs. neqo-latest: run cancelled after 20 min
  • linuxquic vs. neqo-latest: run cancelled after 20 min
  • lsquic vs. neqo-latest: run cancelled after 20 min
  • msquic vs. neqo-latest: ⚠️L2 C2 6 V2 CM
  • mvfst vs. neqo-latest: Z 🚀B ⚠️3 A L1 C1 ⚠️6 CM
  • neqo vs. neqo-latest: ⚠️C20 M B A BA CM
  • ngtcp2 vs. neqo-latest: run cancelled after 20 min
  • openssl vs. neqo-latest: ⚠️H LR ⚠️C20 M 🚀3 B A 🚀L2 C2 BA CM
  • picoquic vs. neqo-latest: run cancelled after 20 min
  • quic-go vs. neqo-latest: ⚠️M S B L2 BP CM
  • quiche vs. neqo-latest: run cancelled after 20 min
  • quinn vs. neqo-latest: run cancelled after 20 min
  • s2n-quic vs. neqo-latest: run cancelled after 20 min
  • tquic vs. neqo-latest: run cancelled after 20 min
  • xquic vs. neqo-latest: run cancelled after 20 min
All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Copy link

github-actions bot commented May 26, 2025

Benchmark results

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client:
       time:   [722.96 ms 743.85 ms 766.88 ms]
       thrpt:  [130.40 MiB/s 134.44 MiB/s 138.32 MiB/s]
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) high severe
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client:
       time:   [313.44 ms 314.67 ms 315.88 ms]
       thrpt:  [31.658 Kelem/s 31.780 Kelem/s 31.904 Kelem/s]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client:
       time:   [34.382 ms 35.594 ms 36.802 ms]
       thrpt:  [27.173  elem/s 28.095  elem/s 29.085  elem/s]
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client:
       time:   [7.3236 s 7.3372 s 7.3509 s]
       thrpt:  [13.604 MiB/s 13.629 MiB/s 13.654 MiB/s]
decode 4096 bytes, mask ff:
       time:   [11.791 µs 11.819 µs 11.855 µs]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  7 (7.00%) high severe
decode 1048576 bytes, mask ff:
       time:   [3.0213 ms 3.0305 ms 3.0415 ms]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
decode 4096 bytes, mask 7f:
       time:   [19.937 µs 19.986 µs 20.042 µs]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  12 (12.00%) high severe
decode 1048576 bytes, mask 7f:
       time:   [5.0463 ms 5.0575 ms 5.0705 ms]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  13 (13.00%) high severe
decode 4096 bytes, mask 3f:
       time:   [8.2710 µs 8.3036 µs 8.3424 µs]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
decode 1048576 bytes, mask 3f:
       time:   [1.5900 ms 1.5969 ms 1.6052 ms]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe
1000 streams of 1 bytes/multistream:
       time:   [14.562 ns 14.606 ns 14.650 ns]
Found 3 outliers among 500 measurements (0.60%)
  2 (0.40%) high mild
  1 (0.20%) high severe
1000 streams of 1000 bytes/multistream:
       time:   [14.559 ns 14.611 ns 14.664 ns]
Found 16 outliers among 500 measurements (3.20%)
  3 (0.60%) low mild
  10 (2.00%) high mild
  3 (0.60%) high severe
coalesce_acked_from_zero 1+1 entries:
       time:   [88.398 ns 88.732 ns 89.062 ns]
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high mild
coalesce_acked_from_zero 3+1 entries:
       time:   [105.98 ns 106.33 ns 106.70 ns]
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high severe
coalesce_acked_from_zero 10+1 entries:
       time:   [105.49 ns 105.99 ns 106.55 ns]
Found 19 outliers among 100 measurements (19.00%)
  3 (3.00%) low severe
  5 (5.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe
coalesce_acked_from_zero 1000+1 entries:
       time:   [89.183 ns 89.326 ns 89.489 ns]
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
RxStreamOrderer::inbound_frame():
       time:   [110.31 ms 110.36 ms 110.42 ms]
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  8 (8.00%) low mild
  8 (8.00%) high mild
  4 (4.00%) high severe
SentPackets::take_ranges:
       time:   [9.7520 µs 9.7921 µs 9.8301 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
transfer/pacing-false/varying-seeds:
       time:   [34.001 ms 34.036 ms 34.071 ms]
transfer/pacing-true/varying-seeds:
       time:   [35.071 ms 35.117 ms 35.163 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
transfer/pacing-false/same-seed:
       time:   [33.981 ms 34.008 ms 34.036 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
transfer/pacing-true/same-seed:
       time:   [35.870 ms 35.910 ms 35.949 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Client/server transfer results

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ main Δ main
google vs. google 518.0 ± 27.0 498.0 694.7 61.8 ± 1.2 -5.6 -1.1%
google vs. neqo (cubic, paced) 381.0 ± 22.8 360.8 467.6 84.0 ± 1.4 3.4 0.9%
msquic vs. msquic 158.4 ± 23.9 132.8 243.1 202.0 ± 1.3 -4.7 -2.9%
msquic vs. neqo (cubic, paced) 278.9 ± 18.1 260.6 393.0 114.8 ± 1.8 -7.3 -2.6%
neqo vs. google (cubic, paced) 816.6 ± 29.0 795.8 1063.8 39.2 ± 1.1 5.0 0.6%
neqo vs. msquic (cubic, paced) 204.3 ± 12.1 193.7 285.4 156.7 ± 2.6 -4.6 -2.2%
neqo vs. neqo (cubic) 257.5 ± 44.7 231.3 457.7 124.3 ± 0.7 6.9 2.8%
neqo vs. neqo (cubic, paced) 252.6 ± 36.9 227.3 496.1 126.7 ± 0.9 💔 7.9 3.2%
neqo vs. neqo (reno) 250.7 ± 39.5 229.4 459.9 127.6 ± 0.8 4.7 1.9%
neqo vs. neqo (reno, paced) 249.6 ± 35.1 231.7 490.0 128.2 ± 0.9 3.7 1.5%
neqo vs. quiche (cubic, paced) 250.3 ± 32.0 226.2 453.1 127.8 ± 1.0 -3.7 -1.4%
neqo vs. s2n (cubic, paced) 262.6 ± 36.0 242.2 485.0 121.9 ± 0.9 -3.1 -1.2%
quiche vs. neqo (cubic, paced) 427.8 ± 52.3 378.9 731.7 74.8 ± 0.6 💔 14.9 3.6%
quiche vs. quiche 210.1 ± 58.4 178.8 508.5 152.3 ± 0.5 5.7 2.8%
s2n vs. neqo (cubic, paced) 308.2 ± 39.8 289.4 612.9 103.8 ± 0.8 6.6 2.2%
s2n vs. s2n 258.8 ± 43.8 235.2 462.8 123.6 ± 0.7 0.2 0.1%

⬇️ Download logs

Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these need 60s of runtime, apart from the last one, which might need more than the default. That transfer test is highly variable by nature. Some of that might depend on how fast your machine is though.

}
criterion_group!(
transfer,
benchmark_transfer_variable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the transfer ones need the extra time. Maybe less so for the one with a fixed seed, but that would mean splitting the group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 reverted the revert in 700d6e9.

Copy link
Collaborator

@larseggert larseggert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with this change, some of the benches appear to have become more flaky again. I'm hoping we'll have additional runners soon, which means the long runtime is less of a blocker for other PRs in flight.

@mxinden
Copy link
Member Author

mxinden commented May 27, 2025

So with this change, some of the benches appear to have become more flaky again.

1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.

   time:   [16.657 ns 16.708 ns 16.759 ns]
   change: [+3.3666% +3.8424% +4.2947%] (p = 0.00 < 0.05)

Found 6 outliers among 500 measurements (1.20%)
1 (0.20%) low mild
5 (1.00%) high mild

Looking at this one as an example. Yes, it shows a regression. But the runtime across runs ([16.657 ns 16.708 ns 16.759 ns]) on this pull request is stable. The lowest value in the confidence interval (i.e. 16.657 ns) and the highest value in the confidence interval (i.e. 16.759 ns) are not far apart, in other words the former being only 0.6 % apart from the latter.

Thus running them for longer will not change the outcome. Does that make sense @larseggert?


Currently we compare the benchmark results with a cached version of main. That cached version might be from some time ago, e.g. yesterday.

I wonder whether we should run the benchmarks for both the current pull request and main on each Action execution. Thoughts @larseggert?

@larseggert
Copy link
Collaborator

I wonder whether we should run the benchmarks for both the current pull request and main on each Action execution. Thoughts @larseggert?

The issue is that GiHub won't ever update a cache entry, it will only create a new one when none is present. We would need to figure out how to use the web API to delete a cache entry. Or we would need to switch to storing bench results as artifacts somehow.

I also have https://bugzilla.mozilla.org/show_bug.cgi?id=1966839 open to investigate using https://bencher.dev/, which would replace much of our custom scripting around this.

@mxinden
Copy link
Member Author

mxinden commented May 27, 2025

I wonder whether we should run the benchmarks for both the current pull request and main on each Action execution. Thoughts @larseggert?

The issue is that GiHub won't ever update a cache entry, it will only create a new one when none is present. We would need to figure out how to use the web API to delete a cache entry. Or we would need to switch to storing bench results as artifacts somehow.

I am suggesting to not cache benchmark results at all, but do something along the lines of:

  1. git checkout $CURRENT_PR
  2. cargo bench -- --save-baseline current
  3. git checkout main`
  4. cargo bench -- --save-baseline main
  5. compare current and main

@mxinden
Copy link
Member Author

mxinden commented May 27, 2025

I also have https://bugzilla.mozilla.org/show_bug.cgi?id=1966839 open to investigate using https://bencher.dev/, which would replace much of our custom scripting around this.

I think worth exploring, thanks. Though this does not resolve the underlying issue, namely that we see a lot of unrelated noise on the benchmark runtime, right?

@larseggert
Copy link
Collaborator

Ah, got it now. Yes, let's try that approach.

Yes, bencher probably wouldn't help here. I was mostly interested in its ability to visualize the performance trajectory over time.

@mxinden
Copy link
Member Author

mxinden commented Jun 2, 2025

Closing here since #2682 merged first.

@mxinden mxinden closed this Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants