Skip to content

[CSIT-1886] 3n: Wireguard tests with 100 and more tunnels are failing PDR criteria #3968

@vvalderrv

Description

@vvalderrv

Description

error: Minimal rate loss ratio 1.0 does not reach target 0.005. Zero packets forwarded!

rca:

test: wireguard 100 tunnels and more

frequency: high

testbed: 3n-icx, 3n-snr

examples:

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/23/log.html.gz#s1-s1-s1-s1-s1-t2

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/51/log.html.gz#s1-s1-s1-s1-s3-t6

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/52/log.html.gz#s1-s1-s1-s1-s3-t6

 

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-snr/2/log.html.gz#s1-s1-s1-s3-s2-t1

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/2/log.html.gz#s1-s1-s1-s3-s8-t1

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/1/log.html.gz#s1-s1-s1-s3-s8-t3

 

https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s2-t6

 

 

NOTE: failing sporadically, and the most of failed tests are IMIX (64B fails very sporadic, 1518B more often, IMIX the most often)

Assignee

Unassigned

Reporter

Viliam Luc

Comments

  • vrpolak (Fri, 15 Nov 2024 09:40:38 +0000): Still occasionally present [9] in rls2410. As before, the DUT recovers before teardown trial, so specific verify runs are needed to see losses in stats.

[9] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2410-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t4-k2-k10-k14

  • vrpolak (Mon, 29 Jul 2024 11:33:03 +0000):

    On 3nb-spr, I notice this is more frequent in hw tests, compared to sw tests. For example in 1tnl testcase, the symptom [8] (Peer error from wg4-input, after few good MRR trials) points to this bug, maybe there is some hardware related cause increasing the frequency.

If confirmed by more runs, I may need to open new ticket specifically for wireguard1tnlhwasync.

[8] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3nb-spr/57/log.html.gz#s1-s1-s1-s3-s9-t2-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1

  • vrpolak (Thu, 25 Jul 2024 13:26:22 +0000): Still present [7] in rls2406.

[7] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-alt/46/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k10-k14

  • vrpolak (Mon, 17 Jun 2024 12:15:49 +0000): Mybe less frequent than before, but failures still happen: [6].

[6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-icx/66/log.html.gz#s1-s1-s1-s3-s8-t2-k2-k10-k14

  • vrpolak (Fri, 24 Nov 2023 13:00:01 +0000): After some unrelated fixes, it is clear that with 9000B tests VPP process crashes, at least its PID vanishes, although core file is not generated. This is similar to other big scale test failures, maybe VPP is running out of memory.
  • vrpolak (Tue, 8 Aug 2023 13:15:49 +0000): I took a deeper look, here are my thoughts.

First, evidence from logs.

Typically, the one-second trial in teardown shows [0] no packet loss.

But the accompanying (and very long as this is scale) stats [1] is interesting in several ways.

Hardware counters for interfaces on dut-dut link show more packets than on tg-dut links.

And there is some unexpected activity on vpp_main (wg4-input) and worker (wg4-handshake-handoff).

When I set up long and extremely low load trial in a verify job, packet trace [2] confirms both observations are caused by additional handshake.

On the other hand, when I reduced persistent keepalive to 1 second, the test passed once [5].

Note that in either case "show node counters verbose" does not report any wireguard specific counters, it only shows ip4-udp-lookup "No error" related to the handshake messages.

Second, why there are rekeys.

Simply, corresponding time constants are hardcoded in VPP [3], causing rekeys every few minutes.

At largest scale, it may take more than 4 minutes to even configure [4] all peers, and even if the corresponding CSIT code is sped up, ndrpdr search may also take several minutes ti finish.

Third, why are there packets lost during search (even at low load)?

I had several hypotheses about that, all of them turned to be false so far, so I currently do not know.

Fourth, should anything be changed in CSIT code?

From design perspective, CSIT wants to test VPP configurations that are expected to yield stable performance, so performance anomaly detection remains sensitive to small regressions. Having waves of rekeys during ndrpdr search goes against that, especially when the performance impact of such a wave in one trial depends on packet loss in previous trials.

Ideally, VPP would have APIs to override the default time-related parameters, so CSIT can use values large enough to prevent any rekeys during search.

The current tests are still useful in pointing out rekey waves can cause packet loss in subsequent (low load) trials, which is perhaps not intended (a VPP bug). Maybe CSIT can have two kinds of wireguard tests, one for traffic without rekeys and one with rekeys.

[0] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k3-k7-k1-k1-k1-k8-k13-k1-k2

[1] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1

[2] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-alt/75/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k7-k1-k1-k1-k8-k14-k2-k1-k1-k1-k1

[3] https://github.com/FDio/vpp/blob/f441b5d0ed8ff9d87412c1640dfec93e9cba03bd/src/plugins/wireguard/wireguard_noise.h#L45

[4] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-alt/23/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k9

[5] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-alt/68/log.html.gz#s1-s1-s1-s1-s1-t1

Original issue: https://jira.fd.io/browse/CSIT-1886

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions