Skip to content

Reduce node k8st test runtime by fixing unnecessary waiting#12024

Open
caseydavenport wants to merge 17 commits intoprojectcalico:masterfrom
caseydavenport:caseydavenport/node-test-perf
Open

Reduce node k8st test runtime by fixing unnecessary waiting#12024
caseydavenport wants to merge 17 commits intoprojectcalico:masterfrom
caseydavenport:caseydavenport/node-test-perf

Conversation

@caseydavenport
Copy link
Member

@caseydavenport caseydavenport commented Mar 7, 2026

The k8st tests have accumulated a bunch of places where they wait way longer than necessary, do redundant work on every test method, or just have overly generous timeouts for a kind cluster. The cumulative effect is pretty rough — the full suite was taking ~28 minutes when it really doesn't need to. This gets it down to ~11 minutes.

Description

Changes roughly grouped by category:

Move expensive setup to setUpClass

Several test classes were creating and destroying Docker containers with external BIRD instances on every single test method. Moving that to setUpClass/tearDownClass does it once per class instead:

  • test_bgp_filter.py: 9 tests were each creating two external Docker containers (v4 + v6) with docker run + apk install + BIRD config — 18 container lifecycles total, now 2.
  • test_bgp_advert.py / test_bgp_advert_v6.py: Same pattern — external node + BGPPeer + Secret creation moved to class-level setup. Also removed a redundant wait_for_deployment call in test_node_exclusion that was waiting on a deployment that hadn't been scaled.
  • test_ipip_spoofing.py: Namespace + pod creation shared across both spoof tests, eliminating one redundant namespace lifecycle and calico-node restart cycle.

Fix DiagsCollector to only collect on failure

DiagsCollector was unconditionally collecting full cluster diagnostics after every test — kubectl get across all namespaces, calicoctl get for 5+ resource types, docker exec on all nodes, kubectl logs for calico-node, and 6 confd template files per node. Now it only fires on test failure.

Remove redundant retry loops

  • BGP advert tests had for i in range(10) loops wrapping retry_until_success(curl, ...) — but retry_until_success already retries up to 90 times. One call is enough.
  • The v6 curl path was using retries=200, wait_time=5 which could block for 16+ minutes on a single attempt.

Reduce excessive timeouts and polling intervals

  • wait_ready/wait_not_ready: 300s → 60s (pods on kind don't need 5 minutes)
  • delete_and_confirm: 10s polling → 2s polling
  • update_ds_env: poll every 3s with 120s timeout instead of 10s polling with no timeout
  • External node container startup: 20s sleep → 2s sleep with a 30s retry cap
  • BGP filter route checks: 270s max → 30s max

Reduce test scope where overkill

  • test_graceful_restart: 8 iterations of delete-and-wait-for-calico-node → 3. Each iteration takes ~36s (calico-node startup time), and 3 is plenty to verify no route churn during graceful restart.
  • test_many_services: 300 services → 50. Still exercises the bulk-create-and-verify path without spending a minute creating services.
  • test_methodology: removed an unconditional time.sleep(5) after a retry_until_success that already confirmed BIRD was running.

Remove unused namespace from BGP filter tests

The 9 test_bgp_filter tests were each creating and deleting a namespace that nothing was ever deployed into. Pure overhead — ~4s per test for no reason.

Fix pre-existing TestLocalBGPPeerRR regex bug

The regex in assertRegex was hardcoding Node_172_18_0_ but kind clusters can use different subnets (e.g., 172.19.x.x). Changed to match Node_ followed by anything.

@caseydavenport caseydavenport added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact labels Mar 7, 2026
@marvin-tigera marvin-tigera added this to the Calico v3.32.0 milestone Mar 7, 2026
@caseydavenport caseydavenport marked this pull request as ready for review March 7, 2026 01:47
@caseydavenport caseydavenport requested a review from a team as a code owner March 7, 2026 01:47
Copilot AI review requested due to automatic review settings March 7, 2026 01:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces runtime of the node k8st test suite (primarily on kind clusters) by eliminating redundant waits/retries and by moving expensive per-test setup work to per-class setup, while also reducing unnecessary diagnostics collection.

Changes:

  • Make DiagsCollector collect diagnostics only on test failure (skip on success).
  • Rework several BGP-related tests to reuse external BGP node/container setup across test methods (setUpClass/tearDownClass) and remove redundant curl retry loops.
  • Reduce excessive polling intervals/timeouts in common k8st helpers (pod readiness waits, deletion confirmation polling, DaemonSet rollout polling).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
node/tests/k8st/utils/utils.py Skip diags collection on success; cap and speed up external node readiness polling.
node/tests/k8st/tests/test_simple.py Remove an extra fixed sleep during BIRD restart testing.
node/tests/k8st/tests/test_local_bgp_peer.py Relax route regex matching to avoid over-specific peer naming assumptions.
node/tests/k8st/tests/test_bgp_filter.py Move external BGP node setup/peering to class-level setup; shorten route-check retry windows.
node/tests/k8st/tests/test_bgp_advert_v6.py Share external node setup across tests and remove redundant curl retry loops.
node/tests/k8st/tests/test_bgp_advert.py Share external node setup across tests and remove redundant curl retry loops / redundant waits.
node/tests/k8st/test_base.py Reduce waits/polling for deletion confirmation, DS rollout, and pod readiness.

The retry loop in start_external_node_with_bgp() slept 20 seconds
between attempts to check if a container was ready, with no upper
bound on retries. Reduce to 2s between attempts with a cap of 30
retries (60s total). Also adds a proper failure if the container
never starts, instead of hanging indefinitely.
delete_and_confirm used retries=10, wait_time=10 (100s max), but
kubectl delete typically completes quickly. Switch to retries=30,
wait_time=2 (60s max) so we detect completion sooner while keeping
the same overall timeout budget. Called 19 times across the test
suite, so the reduced polling interval saves significant time in the
common case where deletion finishes within a few seconds.
The DaemonSet rollout polling loop slept 10s before its first check
and between subsequent checks, with no timeout. Switch to checking
immediately, polling every 3s, and failing after 120s instead of
hanging indefinitely.
5 minutes is far too generous for pods on a local kind cluster.
60 seconds is plenty, and means tests fail faster when a pod will
never become ready rather than blocking for 5 minutes per pod.
The IPv6 curl retry in test_cluster_ip_advertisement used retries=200
with wait_time=5, meaning a single curl attempt could block for up to
1000 seconds (16+ minutes). Inside a loop of 10 attempts, the
theoretical max was 10,000 seconds. This was likely bumped as a
workaround for v6 flakiness. Switch to the default retry parameters
(retries=90, wait_time=1) to match the IPv4 equivalent.
The route check helpers in test_bgp_filter.py used wait_time=3 with
the default retries=90, giving a max timeout of 270s (4.5 minutes)
per route check. Routes on a local kind cluster should appear within
30s. Set retries=10 (30s max) which is still generous while cutting
the worst-case timeout by 9x. These helpers are called many times
per test so the cumulative savings are significant.
DiagsCollector was unconditionally collecting diagnostics on every test
exit, including on success. This involves kubectl get across all
namespaces, calicoctl get for 5+ resource types, docker exec ip r on all
nodes, kubectl logs for calico-node, and reading 6 confd template files
per node. With ~30 test methods, this burns 15-30 minutes on successful
runs for no benefit since the diags are only useful on failure.
The 9 test_bgp_filter tests all share the same external BGP node
infrastructure. Previously, setUp/tearDown created and destroyed two
external Docker containers (v4 + v6) for every single test method —
that's 18 container lifecycles involving Docker run, apk install, and
BIRD config. Moving to setUpClass/tearDownClass creates them once and
tears them down once, saving ~16 unnecessary container setups.
The tests were curling services 10 times in a loop, with each iteration
using retry_until_success (up to 90 retries × 1s). A single
retry_until_success call already provides robust validation that the
service is reachable — repeating it 10 times just burns time.
The external node (kube-node-extra) and BGPPeer are created and
destroyed for every test method. TestBGPAdvert has 7 test methods and
TestBGPAdvertRR has 2 — that's 9 container lifecycles involving docker
run, apk install, and BIRD config, plus 9 teardowns. Same for the v6
equivalents. Moving to setUpClass/tearDownClass creates the external
node once per class.

Each test still applies its own BGPConfiguration and cleans it up in
tearDown, so the per-test isolation is preserved.

Also removes a redundant wait_for_deployment(cluster_svc) call in
test_node_exclusion — only local_svc was scaled, cluster_svc hadn't
changed.
retry_until_success already confirmed BIRD is running. The extra 5s
sleep ran 3 times (once per iteration), adding 15s for no reason.
The test hardcoded Node_172_18_0_ in the BIRD protocol name regex,
but kind clusters can use different subnets (e.g., 172.19.x.x). The
actual node IP is already matched via %s — the protocol name just
needs to start with Node_.
8 iterations of delete-and-wait-for-calico-node is excessive for
verifying no route churn during graceful restart. 3 iterations is
plenty to prove the behavior, and saves ~180s (each iteration takes
~36s for calico-node to restart and become ready).
The IPIP/VXLAN spoof tests were creating a namespace, two pods,
and tearing everything down (including a calico-node restart) for
each test method. Moving to setUpClass/tearDownClass shares the
infrastructure across both tests, eliminating one redundant
namespace lifecycle and calico-node restart cycle (~85s saved).
300 services is overkill for verifying BGP advertisement at scale
on a kind cluster. 50 still exercises the bulk-create-and-verify
code path. Saves ~90s across v4 and v6 tests.
The BGP filter tests created and deleted a namespace per test but
never deployed anything into it. Removing the unnecessary namespace
lifecycle saves ~4s per test (9 tests, ~34s total).
The rollout polling was hardcoded to calico-node/calico-system instead
of using the ds/ns parameters passed to the function.
@caseydavenport caseydavenport force-pushed the caseydavenport/node-test-perf branch from 4f2b190 to 0bae1cc Compare March 12, 2026 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants