bench: revert custom bench run length on stable benchmarks #2665

mxinden · 2025-05-26T13:59:16Z

#2655 increased the benchmark runtime across all benchmarks. This made benchmarks more stable.

Looking at the results of a recent run, this does not seem to be necessary for smaller benchmarks. Thus this commit reverts the custom setting for some in order to save CI runtime.

https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533

Minor optimization. Don't feel strongly about it.

mozilla#2655 increased the benchmark runtime across all benchmarks. This made benchmarks more stable. Looking at the results of a recent run, this does not seem to be necessary for smaller benchmarks. Thus this commit reverts the custom setting for some in order to save CI runtime. https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533

mxinden · 2025-05-26T14:05:41Z

For the record, currently our cargo bench step takes > 1h:

https://github.com/mozilla/neqo/actions/runs/15251928213/job/42891415533

github-actions · 2025-05-26T14:25:22Z

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 070ff3a.

neqo-latest as client

neqo-latest vs. aioquic: ⚠️H DC C20 Z B A 6 BA
neqo-latest vs. go-x-net: ⚠️6 BP BA
neqo-latest vs. haproxy: ⚠️LR A L2 C1 V2 BP BA
neqo-latest vs. kwik: 🚀Z ⚠️H B U L1 🚀L2 C1 ⚠️C2 BP BA
neqo-latest vs. linuxquic: run cancelled after 20 min
neqo-latest vs. lsquic: ⚠️DC C20 L1 L2 C1 BP
neqo-latest vs. msquic: 🚀H ⚠️M S R 🚀~~Z B U~~ A L1 🚀L2 C1 🚀V2 BA
neqo-latest vs. mvfst: run cancelled after 20 min
neqo-latest vs. neqo: 🚀~~H 3~~ ⚠️S U E A ⚠️C1 V2 BP
neqo-latest vs. neqo-latest: 🚀~~H DC M~~ ⚠️E A 🚀L1 ⚠️L2 C1 🚀~~6 BA~~ ⚠️BP CM
neqo-latest vs. nginx: ⚠️H LR R Z 3 A L2 C1 BP BA
neqo-latest vs. ngtcp2: run cancelled after 20 min
neqo-latest vs. picoquic: 🚀~~H S R~~ ⚠️LR C20 Z ⚠️3 A L1 🚀C1 6 V2 BP BA
neqo-latest vs. quic-go: 🚀DC ⚠️H LR C20 🚀Z ⚠️R 3 A 🚀~~C1 C2 6~~ ⚠️BP BA
neqo-latest vs. quiche: 🚀~~H LR S~~ ⚠️C20 M B 🚀A ⚠️L1 L2 ⚠️C1 C2 BP BA
neqo-latest vs. quinn: 🚀~~C20 3~~ ⚠️H DC LR Z U E 🚀~~L2 C1~~ ⚠️A L1
neqo-latest vs. s2n-quic: H 🚀~~B BP~~ ⚠️U E L1 L2 C1 C2 BA CM
neqo-latest vs. tquic: 🚀DC C20 ⚠️M S 🚀R Z 3 🚀~~U L1 L2 C1 6~~ BP BA
neqo-latest vs. xquic: ⚠️LR M Z 3 B A L1 C1

neqo-latest as server

aioquic vs. neqo-latest: run cancelled after 20 min
go-x-net vs. neqo-latest: H 🚀DC ⚠️B L2 6 ⚠️CM
kwik vs. neqo-latest: run cancelled after 20 min
linuxquic vs. neqo-latest: run cancelled after 20 min
lsquic vs. neqo-latest: run cancelled after 20 min
msquic vs. neqo-latest: ⚠️L2 C2 6 V2 CM
mvfst vs. neqo-latest: Z 🚀B ⚠️3 A L1 C1 ⚠️6 CM
neqo vs. neqo-latest: ⚠️C20 M B A BA CM
ngtcp2 vs. neqo-latest: run cancelled after 20 min
openssl vs. neqo-latest: ⚠️H LR ⚠️C20 M 🚀~~3 B~~ A 🚀~~L2 C2 BA~~ CM
picoquic vs. neqo-latest: run cancelled after 20 min
quic-go vs. neqo-latest: ⚠️M S B L2 BP CM
quiche vs. neqo-latest: run cancelled after 20 min
quinn vs. neqo-latest: run cancelled after 20 min
s2n-quic vs. neqo-latest: run cancelled after 20 min
tquic vs. neqo-latest: run cancelled after 20 min
xquic vs. neqo-latest: run cancelled after 20 min

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: 🚀~~LR M S R 3 U L1 L2 C1 C2 V2 BP~~
neqo-latest vs. go-x-net: 🚀~~H DC LR M B U A L2 C2~~
neqo-latest vs. haproxy: 🚀~~H DC C20 M S R Z 3 B U L1 C2 6~~
neqo-latest vs. kwik: ⚠️H DC LR C20 M S R 🚀Z 3 ⚠️B U A ⚠️C2 🚀L2 6 V2
neqo-latest vs. lsquic: 🚀~~H LR M S R Z 3 B U E A C2 6 V2 BA CM~~
neqo-latest vs. msquic: 🚀H DC LR C20 ⚠️M S 🚀~~Z B U L2~~ C2 6 🚀V2 BP
neqo-latest vs. neqo: 🚀H DC LR C20 M ⚠️S R Z 🚀3 B ⚠️U E L1 L2 ⚠️C1 C2 6 ⚠️V2 BA CM
neqo-latest vs. neqo-latest: 🚀~~H DC~~ LR C20 🚀M S R Z 3 B U ⚠️E L2 🚀L1 C2 🚀6 V2 ⚠️BP CM 🚀BA
neqo-latest vs. nginx: 🚀~~DC C20 M S B U L1 C2 6~~
neqo-latest vs. picoquic: 🚀H DC ⚠️LR C20 M ⚠️3 🚀~~S R~~ B U E L2 🚀C1 C2
neqo-latest vs. quic-go: ⚠️H 🚀DC M S ⚠️R 3 🚀Z B U L1 L2 ⚠️BP BA 🚀~~C1 C2 6~~
neqo-latest vs. quiche: 🚀H DC ⚠️C20 M 🚀~~LR S~~ R Z 3 U ⚠️L1 C1 🚀A 6
neqo-latest vs. quinn: ⚠️H DC LR 🚀~~C20~~ M S R ⚠️Z 🚀3 B ⚠️U A L1 🚀~~L2 C1~~ C2 6 BP BA
neqo-latest vs. s2n-quic: DC LR C20 M S R 3 ⚠️U E 🚀B A ⚠️L1 L2 C1 C2 6 🚀BP
neqo-latest vs. tquic: H 🚀DC LR ⚠️M 🚀R B 🚀U A 🚀~~L1 L2 C1~~ C2 🚀6
neqo-latest vs. xquic: 🚀~~H DC C20 R U L2 C2 6 BP BA~~

neqo-latest as server

chrome vs. neqo-latest: 🚀3
go-x-net vs. neqo-latest: 🚀DC LR ⚠️M B U L2 🚀A C2 BP 🚀BA
msquic vs. neqo-latest: 🚀~~H DC LR C20 M S R Z B A L1 C1 BP~~
mvfst vs. neqo-latest: H DC LR ⚠️3 🚀B L2 C2 ⚠️6 BP ⚠️BA
neqo vs. neqo-latest: 🚀~~H DC LR S R Z 3 U E L1 L2 C1 C2 6 V2 BP~~
openssl vs. neqo-latest: ⚠️H DC ⚠️C20 S R 🚀~~3 B L2 C2~~ 6 BP 🚀BA
quic-go vs. neqo-latest: 🚀~~H DC LR C20 R Z 3 U A L1 C1 C2 6 BA~~

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: E CM
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2 CM
neqo-latest vs. haproxy: E CM
neqo-latest vs. kwik: E CM
neqo-latest vs. msquic: 3 E CM
neqo-latest vs. nginx: E V2 CM
neqo-latest vs. picoquic: CM
neqo-latest vs. quic-go: E V2 CM
neqo-latest vs. quiche: E V2 CM
neqo-latest vs. quinn: V2 CM
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. tquic: E V2 CM
neqo-latest vs. xquic: S E V2 CM

neqo-latest as server

chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2 BP BA CM
go-x-net vs. neqo-latest: C20 M S R Z 3 U E A L1 C1 V2 BA CM
msquic vs. neqo-latest: 3 U E BA
mvfst vs. neqo-latest: C20 M S R U E V2 BA
openssl vs. neqo-latest: Z U E L1 C1 V2
quic-go vs. neqo-latest: E V2

github-actions · 2025-05-26T15:32:28Z

Benchmark results

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client:

       time:   [722.96 ms 743.85 ms 766.88 ms]
       thrpt:  [130.40 MiB/s 134.44 MiB/s 138.32 MiB/s]
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) high severe

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client:

       time:   [313.44 ms 314.67 ms 315.88 ms]
       thrpt:  [31.658 Kelem/s 31.780 Kelem/s 31.904 Kelem/s]

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client:

       time:   [34.382 ms 35.594 ms 36.802 ms]
       thrpt:  [27.173  elem/s 28.095  elem/s 29.085  elem/s]

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client:

       time:   [7.3236 s 7.3372 s 7.3509 s]
       thrpt:  [13.604 MiB/s 13.629 MiB/s 13.654 MiB/s]

decode 4096 bytes, mask ff:

       time:   [11.791 µs 11.819 µs 11.855 µs]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  7 (7.00%) high severe

decode 1048576 bytes, mask ff:

       time:   [3.0213 ms 3.0305 ms 3.0415 ms]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

decode 4096 bytes, mask 7f:

       time:   [19.937 µs 19.986 µs 20.042 µs]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  12 (12.00%) high severe

decode 1048576 bytes, mask 7f:

       time:   [5.0463 ms 5.0575 ms 5.0705 ms]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  13 (13.00%) high severe

decode 4096 bytes, mask 3f:

       time:   [8.2710 µs 8.3036 µs 8.3424 µs]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

decode 1048576 bytes, mask 3f:

       time:   [1.5900 ms 1.5969 ms 1.6052 ms]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

1000 streams of 1 bytes/multistream:

       time:   [14.562 ns 14.606 ns 14.650 ns]
Found 3 outliers among 500 measurements (0.60%)
  2 (0.40%) high mild
  1 (0.20%) high severe

1000 streams of 1000 bytes/multistream:

       time:   [14.559 ns 14.611 ns 14.664 ns]
Found 16 outliers among 500 measurements (3.20%)
  3 (0.60%) low mild
  10 (2.00%) high mild
  3 (0.60%) high severe

coalesce_acked_from_zero 1+1 entries:

       time:   [88.398 ns 88.732 ns 89.062 ns]
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high mild

coalesce_acked_from_zero 3+1 entries:

       time:   [105.98 ns 106.33 ns 106.70 ns]
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high severe

coalesce_acked_from_zero 10+1 entries:

       time:   [105.49 ns 105.99 ns 106.55 ns]
Found 19 outliers among 100 measurements (19.00%)
  3 (3.00%) low severe
  5 (5.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

coalesce_acked_from_zero 1000+1 entries:

       time:   [89.183 ns 89.326 ns 89.489 ns]
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

RxStreamOrderer::inbound_frame():

       time:   [110.31 ms 110.36 ms 110.42 ms]
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  8 (8.00%) low mild
  8 (8.00%) high mild
  4 (4.00%) high severe

SentPackets::take_ranges:

       time:   [9.7520 µs 9.7921 µs 9.8301 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild

transfer/pacing-false/varying-seeds:

       time:   [34.001 ms 34.036 ms 34.071 ms]

transfer/pacing-true/varying-seeds:

       time:   [35.071 ms 35.117 ms 35.163 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

transfer/pacing-false/same-seed:

       time:   [33.981 ms 34.008 ms 34.036 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

transfer/pacing-true/same-seed:

       time:   [35.870 ms 35.910 ms 35.949 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Client/server transfer results

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params)	Mean ± σ	Min	Max	MiB/s ± σ	Δ `main`	Δ `main`
google vs. google	518.0 ± 27.0	498.0	694.7	61.8 ± 1.2	-5.6	-1.1%
google vs. neqo (cubic, paced)	381.0 ± 22.8	360.8	467.6	84.0 ± 1.4	3.4	0.9%
msquic vs. msquic	158.4 ± 23.9	132.8	243.1	202.0 ± 1.3	-4.7	-2.9%
msquic vs. neqo (cubic, paced)	278.9 ± 18.1	260.6	393.0	114.8 ± 1.8	-7.3	-2.6%
neqo vs. google (cubic, paced)	816.6 ± 29.0	795.8	1063.8	39.2 ± 1.1	5.0	0.6%
neqo vs. msquic (cubic, paced)	204.3 ± 12.1	193.7	285.4	156.7 ± 2.6	-4.6	-2.2%
neqo vs. neqo (cubic)	257.5 ± 44.7	231.3	457.7	124.3 ± 0.7	6.9	2.8%
neqo vs. neqo (cubic, paced)	252.6 ± 36.9	227.3	496.1	126.7 ± 0.9	💔 7.9	3.2%
neqo vs. neqo (reno)	250.7 ± 39.5	229.4	459.9	127.6 ± 0.8	4.7	1.9%
neqo vs. neqo (reno, paced)	249.6 ± 35.1	231.7	490.0	128.2 ± 0.9	3.7	1.5%
neqo vs. quiche (cubic, paced)	250.3 ± 32.0	226.2	453.1	127.8 ± 1.0	-3.7	-1.4%
neqo vs. s2n (cubic, paced)	262.6 ± 36.0	242.2	485.0	121.9 ± 0.9	-3.1	-1.2%
quiche vs. neqo (cubic, paced)	427.8 ± 52.3	378.9	731.7	74.8 ± 0.6	💔 14.9	3.6%
quiche vs. quiche	210.1 ± 58.4	178.8	508.5	152.3 ± 0.5	5.7	2.8%
s2n vs. neqo (cubic, paced)	308.2 ± 39.8	289.4	612.9	103.8 ± 0.8	6.6	2.2%
s2n vs. s2n	258.8 ± 43.8	235.2	462.8	123.6 ± 0.7	0.2	0.1%

⬇️ Download logs

martinthomson

None of these need 60s of runtime, apart from the last one, which might need more than the default. That transfer test is highly variable by nature. Some of that might depend on how fast your machine is though.

martinthomson · 2025-05-26T23:12:00Z

neqo-transport/benches/transfer.rs

-}
+criterion_group!(
+    transfer,
+    benchmark_transfer_variable,


I think that the transfer ones need the extra time. Maybe less so for the one with a fixed seed, but that would mean splitting the group.

👍 reverted the revert in 700d6e9.

larseggert

So with this change, some of the benches appear to have become more flaky again. I'm hoping we'll have additional runners soon, which means the long runtime is less of a blocker for other PRs in flight.

mxinden · 2025-05-27T09:09:47Z

So with this change, some of the benches appear to have become more flaky again.

1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.
   time:   [16.657 ns 16.708 ns 16.759 ns]
   change: [+3.3666% +3.8424% +4.2947%] (p = 0.00 < 0.05)
Found 6 outliers among 500 measurements (1.20%)
1 (0.20%) low mild
5 (1.00%) high mild

Looking at this one as an example. Yes, it shows a regression. But the runtime across runs ([16.657 ns 16.708 ns 16.759 ns]) on this pull request is stable. The lowest value in the confidence interval (i.e. 16.657 ns) and the highest value in the confidence interval (i.e. 16.759 ns) are not far apart, in other words the former being only 0.6 % apart from the latter.

Thus running them for longer will not change the outcome. Does that make sense @larseggert?

Currently we compare the benchmark results with a cached version of main. That cached version might be from some time ago, e.g. yesterday.

I wonder whether we should run the benchmarks for both the current pull request and main on each Action execution. Thoughts @larseggert?

larseggert · 2025-05-27T09:52:52Z

I wonder whether we should run the benchmarks for both the current pull request and main on each Action execution. Thoughts @larseggert?

The issue is that GiHub won't ever update a cache entry, it will only create a new one when none is present. We would need to figure out how to use the web API to delete a cache entry. Or we would need to switch to storing bench results as artifacts somehow.

I also have https://bugzilla.mozilla.org/show_bug.cgi?id=1966839 open to investigate using https://bencher.dev/, which would replace much of our custom scripting around this.

mxinden · 2025-05-27T14:54:23Z

I wonder whether we should run the benchmarks for both the current pull request and main on each Action execution. Thoughts @larseggert?

The issue is that GiHub won't ever update a cache entry, it will only create a new one when none is present. We would need to figure out how to use the web API to delete a cache entry. Or we would need to switch to storing bench results as artifacts somehow.

I am suggesting to not cache benchmark results at all, but do something along the lines of:

git checkout $CURRENT_PR
cargo bench -- --save-baseline current
git checkout main`
cargo bench -- --save-baseline main
compare current and main

mxinden · 2025-05-27T14:55:47Z

I also have https://bugzilla.mozilla.org/show_bug.cgi?id=1966839 open to investigate using https://bencher.dev/, which would replace much of our custom scripting around this.

I think worth exploring, thanks. Though this does not resolve the underlying issue, namely that we see a lot of unrelated noise on the benchmark runtime, right?

larseggert · 2025-05-27T16:58:31Z

Ah, got it now. Yes, let's try that approach.

Yes, bencher probably wouldn't help here. I was mostly interested in its ability to visualize the performance trajectory over time.

mxinden · 2025-06-02T14:35:59Z

Closing here since #2682 merged first.

mxinden changed the title ~~bench: revert custom benc run length on stable benchmarks~~ bench: revert custom bench run length on stable benchmarks May 26, 2025

martinthomson reviewed May 26, 2025

View reviewed changes

larseggert reviewed May 27, 2025

View reviewed changes

Continue runnint transfer.rs bench for longer

700d6e9

larseggert mentioned this pull request May 28, 2025

Don't cache benchmark results #2672

Closed

mxinden closed this Jun 2, 2025

bench: revert custom bench run length on stable benchmarks #2665

bench: revert custom bench run length on stable benchmarks #2665

Uh oh!

Conversation

mxinden commented May 26, 2025

Uh oh!

mxinden commented May 26, 2025

Uh oh!

github-actions bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failed Interop Tests

neqo-latest as client

neqo-latest as server

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

Uh oh!

github-actions bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Client/server transfer results

Uh oh!

martinthomson left a comment

Choose a reason for hiding this comment

Uh oh!

martinthomson May 26, 2025

Choose a reason for hiding this comment

Uh oh!

mxinden May 27, 2025

Choose a reason for hiding this comment

Uh oh!

larseggert left a comment

Choose a reason for hiding this comment

Uh oh!

mxinden commented May 27, 2025

Uh oh!

larseggert commented May 27, 2025

Uh oh!

mxinden commented May 27, 2025

Uh oh!

mxinden commented May 27, 2025

Uh oh!

larseggert commented May 27, 2025

Uh oh!

mxinden commented Jun 2, 2025

Uh oh!

Uh oh!

github-actions bot commented May 26, 2025 •

edited

Loading

github-actions bot commented May 26, 2025 •

edited

Loading