[CSIT-1886] 3n: Wireguard tests with 100 and more tunnels are failing PDR criteria

### Description
error: Minimal rate loss ratio 1.0 does not reach target 0.005. Zero packets forwarded!

rca:

test: wireguard 100 tunnels and more

frequency: high

testbed: 3n-icx, 3n-snr

examples:

<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/23/log.html.gz#s1-s1-s1-s1-s1-t2" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/23/log.html.gz#s1-s1-s1-s1-s1-t2</a>


<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/51/log.html.gz#s1-s1-s1-s1-s3-t6" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/51/log.html.gz#s1-s1-s1-s1-s3-t6</a>


<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/52/log.html.gz#s1-s1-s1-s1-s3-t6" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/52/log.html.gz#s1-s1-s1-s1-s3-t6</a>


 


<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-snr/2/log.html.gz#s1-s1-s1-s3-s2-t1" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-snr/2/log.html.gz#s1-s1-s1-s3-s2-t1</a>


<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/2/log.html.gz#s1-s1-s1-s3-s8-t1" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/2/log.html.gz#s1-s1-s1-s3-s8-t1</a>


<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/1/log.html.gz#s1-s1-s1-s3-s8-t3" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/1/log.html.gz#s1-s1-s1-s3-s8-t3</a>


 


<a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s2-t6" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s2-t6</a>


 


 


NOTE: failing sporadically, and the most of failed tests are IMIX (64B fails very sporadic, 1518B more often, IMIX the most often)

### Assignee
Unassigned

### Reporter
Viliam Luc

### Comments
- **vrpolak (Fri, 15 Nov 2024 09:40:38 +0000)**: Still occasionally present [9] in rls2410. As before, the DUT recovers before teardown trial, so specific verify runs are needed to see losses in stats.


[9] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2410-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t4-k2-k10-k14" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2410-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t4-k2-k10-k14</a>
- **vrpolak (Mon, 29 Jul 2024 11:33:03 +0000)**: On 3nb-spr, I notice this is more frequent in hw tests, compared to sw tests. For example in 1tnl testcase, the symptom [8] (Peer error from wg4-input, after few good MRR trials) points to this bug, maybe there is some hardware related cause increasing the frequency.

If confirmed by more runs, I may need to open new ticket specifically for wireguard1tnlhwasync.

[8] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3nb-spr/57/log.html.gz#s1-s1-s1-s3-s9-t2-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3nb-spr/57/log.html.gz#s1-s1-s1-s3-s9-t2-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1</a>
- **vrpolak (Thu, 25 Jul 2024 13:26:22 +0000)**: Still present [7] in rls2406.


[7] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-alt/46/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k10-k14" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-alt/46/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k10-k14</a>
- **vrpolak (Mon, 17 Jun 2024 12:15:49 +0000)**: Mybe less frequent than before, but failures still happen: [6].


[6] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-icx/66/log.html.gz#s1-s1-s1-s3-s8-t2-k2-k10-k14" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-icx/66/log.html.gz#s1-s1-s1-s3-s8-t2-k2-k10-k14</a>
- **vrpolak (Fri, 24 Nov 2023 13:00:01 +0000)**: After some unrelated fixes, it is clear that with 9000B tests VPP process crashes, at least its PID vanishes, although core file is not generated. This is similar to other big scale test failures, maybe VPP is running out of memory.
- **vrpolak (Tue, 8 Aug 2023 13:15:49 +0000)**: I took a deeper look, here are my thoughts.


First, evidence from logs.

Typically, the one-second trial in teardown shows [0] no packet loss.

But the accompanying (and very long as this is scale) stats [1] is interesting in several ways.

Hardware counters for interfaces on dut-dut link show more packets than on tg-dut links.

And there is some unexpected activity on vpp_main (wg4-input) and worker (wg4-handshake-handoff).

When I set up long and extremely low load trial in a verify job, packet trace [2] confirms both observations are caused by additional handshake.

On the other hand, when I reduced persistent keepalive to 1 second, the test passed once [5].

Note that in either case "show node counters verbose" does not report any wireguard specific counters, it only shows ip4-udp-lookup "No error" related to the handshake messages.

Second, why there are rekeys.

Simply, corresponding time constants are hardcoded in VPP [3], causing rekeys every few minutes.

At largest scale, it may take more than 4 minutes to even configure [4] all peers, and even if the corresponding CSIT code is sped up, ndrpdr search may also take several minutes ti finish.

Third, why are there packets lost during search (even at low load)?

I had several hypotheses about that, all of them turned to be false so far, so I currently do not know.

Fourth, should anything be changed in CSIT code?

From design perspective, CSIT wants to test VPP configurations that are expected to yield stable performance, so performance anomaly detection remains sensitive to small regressions. Having waves of rekeys during ndrpdr search goes against that, especially when the performance impact of such a wave in one trial depends on packet loss in previous trials.

Ideally, VPP would have APIs to override the default time-related parameters, so CSIT can use values large enough to prevent any rekeys during search.

The current tests are still useful in pointing out rekey waves can cause packet loss in subsequent (low load) trials, which is perhaps not intended (a VPP bug). Maybe CSIT can have two kinds of wireguard tests, one for traffic without rekeys and one with rekeys.

[0] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k3-k7-k1-k1-k1-k8-k13-k1-k2" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k3-k7-k1-k1-k1-k8-k13-k1-k2</a>

[1] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1</a>

[2] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-alt/75/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k7-k1-k1-k1-k8-k14-k2-k1-k1-k1-k1" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-alt/75/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k7-k1-k1-k1-k8-k14-k2-k1-k1-k1-k1</a>

[3] <a href="https://github.com/FDio/vpp/blob/f441b5d0ed8ff9d87412c1640dfec93e9cba03bd/src/plugins/wireguard/wireguard_noise.h#L45" class="external-link" target="_blank" rel="nofollow noopener">https://github.com/FDio/vpp/blob/f441b5d0ed8ff9d87412c1640dfec93e9cba03bd/src/plugins/wireguard/wireguard_noise.h#L45</a>

[4] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k9" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k9</a>

[5] <a href="https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-alt/68/log.html.gz#s1-s1-s1-s1-s1-t1" class="external-link" target="_blank" rel="nofollow noopener">https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-alt/68/log.html.gz#s1-s1-s1-s1-s1-t1</a>

Original issue: https://jira.fd.io/browse/CSIT-1886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CSIT-1886] 3n: Wireguard tests with 100 and more tunnels are failing PDR criteria #3968

Description

Assignee

Reporter

Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CSIT-1886] 3n: Wireguard tests with 100 and more tunnels are failing PDR criteria #3968

Description

Description

Assignee

Reporter

Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions