Reduce node k8st test runtime by fixing unnecessary waiting#12024
Open
caseydavenport wants to merge 17 commits intoprojectcalico:masterfrom
Open
Reduce node k8st test runtime by fixing unnecessary waiting#12024caseydavenport wants to merge 17 commits intoprojectcalico:masterfrom
caseydavenport wants to merge 17 commits intoprojectcalico:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces runtime of the node k8st test suite (primarily on kind clusters) by eliminating redundant waits/retries and by moving expensive per-test setup work to per-class setup, while also reducing unnecessary diagnostics collection.
Changes:
- Make
DiagsCollectorcollect diagnostics only on test failure (skip on success). - Rework several BGP-related tests to reuse external BGP node/container setup across test methods (
setUpClass/tearDownClass) and remove redundant curl retry loops. - Reduce excessive polling intervals/timeouts in common k8st helpers (pod readiness waits, deletion confirmation polling, DaemonSet rollout polling).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
node/tests/k8st/utils/utils.py |
Skip diags collection on success; cap and speed up external node readiness polling. |
node/tests/k8st/tests/test_simple.py |
Remove an extra fixed sleep during BIRD restart testing. |
node/tests/k8st/tests/test_local_bgp_peer.py |
Relax route regex matching to avoid over-specific peer naming assumptions. |
node/tests/k8st/tests/test_bgp_filter.py |
Move external BGP node setup/peering to class-level setup; shorten route-check retry windows. |
node/tests/k8st/tests/test_bgp_advert_v6.py |
Share external node setup across tests and remove redundant curl retry loops. |
node/tests/k8st/tests/test_bgp_advert.py |
Share external node setup across tests and remove redundant curl retry loops / redundant waits. |
node/tests/k8st/test_base.py |
Reduce waits/polling for deletion confirmation, DS rollout, and pod readiness. |
The retry loop in start_external_node_with_bgp() slept 20 seconds between attempts to check if a container was ready, with no upper bound on retries. Reduce to 2s between attempts with a cap of 30 retries (60s total). Also adds a proper failure if the container never starts, instead of hanging indefinitely.
delete_and_confirm used retries=10, wait_time=10 (100s max), but kubectl delete typically completes quickly. Switch to retries=30, wait_time=2 (60s max) so we detect completion sooner while keeping the same overall timeout budget. Called 19 times across the test suite, so the reduced polling interval saves significant time in the common case where deletion finishes within a few seconds.
The DaemonSet rollout polling loop slept 10s before its first check and between subsequent checks, with no timeout. Switch to checking immediately, polling every 3s, and failing after 120s instead of hanging indefinitely.
5 minutes is far too generous for pods on a local kind cluster. 60 seconds is plenty, and means tests fail faster when a pod will never become ready rather than blocking for 5 minutes per pod.
The IPv6 curl retry in test_cluster_ip_advertisement used retries=200 with wait_time=5, meaning a single curl attempt could block for up to 1000 seconds (16+ minutes). Inside a loop of 10 attempts, the theoretical max was 10,000 seconds. This was likely bumped as a workaround for v6 flakiness. Switch to the default retry parameters (retries=90, wait_time=1) to match the IPv4 equivalent.
The route check helpers in test_bgp_filter.py used wait_time=3 with the default retries=90, giving a max timeout of 270s (4.5 minutes) per route check. Routes on a local kind cluster should appear within 30s. Set retries=10 (30s max) which is still generous while cutting the worst-case timeout by 9x. These helpers are called many times per test so the cumulative savings are significant.
DiagsCollector was unconditionally collecting diagnostics on every test exit, including on success. This involves kubectl get across all namespaces, calicoctl get for 5+ resource types, docker exec ip r on all nodes, kubectl logs for calico-node, and reading 6 confd template files per node. With ~30 test methods, this burns 15-30 minutes on successful runs for no benefit since the diags are only useful on failure.
The 9 test_bgp_filter tests all share the same external BGP node infrastructure. Previously, setUp/tearDown created and destroyed two external Docker containers (v4 + v6) for every single test method — that's 18 container lifecycles involving Docker run, apk install, and BIRD config. Moving to setUpClass/tearDownClass creates them once and tears them down once, saving ~16 unnecessary container setups.
The tests were curling services 10 times in a loop, with each iteration using retry_until_success (up to 90 retries × 1s). A single retry_until_success call already provides robust validation that the service is reachable — repeating it 10 times just burns time.
The external node (kube-node-extra) and BGPPeer are created and destroyed for every test method. TestBGPAdvert has 7 test methods and TestBGPAdvertRR has 2 — that's 9 container lifecycles involving docker run, apk install, and BIRD config, plus 9 teardowns. Same for the v6 equivalents. Moving to setUpClass/tearDownClass creates the external node once per class. Each test still applies its own BGPConfiguration and cleans it up in tearDown, so the per-test isolation is preserved. Also removes a redundant wait_for_deployment(cluster_svc) call in test_node_exclusion — only local_svc was scaled, cluster_svc hadn't changed.
retry_until_success already confirmed BIRD is running. The extra 5s sleep ran 3 times (once per iteration), adding 15s for no reason.
The test hardcoded Node_172_18_0_ in the BIRD protocol name regex, but kind clusters can use different subnets (e.g., 172.19.x.x). The actual node IP is already matched via %s — the protocol name just needs to start with Node_.
8 iterations of delete-and-wait-for-calico-node is excessive for verifying no route churn during graceful restart. 3 iterations is plenty to prove the behavior, and saves ~180s (each iteration takes ~36s for calico-node to restart and become ready).
The IPIP/VXLAN spoof tests were creating a namespace, two pods, and tearing everything down (including a calico-node restart) for each test method. Moving to setUpClass/tearDownClass shares the infrastructure across both tests, eliminating one redundant namespace lifecycle and calico-node restart cycle (~85s saved).
300 services is overkill for verifying BGP advertisement at scale on a kind cluster. 50 still exercises the bulk-create-and-verify code path. Saves ~90s across v4 and v6 tests.
The BGP filter tests created and deleted a namespace per test but never deployed anything into it. Removing the unnecessary namespace lifecycle saves ~4s per test (9 tests, ~34s total).
The rollout polling was hardcoded to calico-node/calico-system instead of using the ds/ns parameters passed to the function.
4f2b190 to
0bae1cc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The k8st tests have accumulated a bunch of places where they wait way longer than necessary, do redundant work on every test method, or just have overly generous timeouts for a kind cluster. The cumulative effect is pretty rough — the full suite was taking ~28 minutes when it really doesn't need to. This gets it down to ~11 minutes.
Description
Changes roughly grouped by category:
Move expensive setup to
setUpClassSeveral test classes were creating and destroying Docker containers with external BIRD instances on every single test method. Moving that to
setUpClass/tearDownClassdoes it once per class instead:test_bgp_filter.py: 9 tests were each creating two external Docker containers (v4 + v6) withdocker run+apk install+ BIRD config — 18 container lifecycles total, now 2.test_bgp_advert.py/test_bgp_advert_v6.py: Same pattern — external node + BGPPeer + Secret creation moved to class-level setup. Also removed a redundantwait_for_deploymentcall intest_node_exclusionthat was waiting on a deployment that hadn't been scaled.test_ipip_spoofing.py: Namespace + pod creation shared across both spoof tests, eliminating one redundant namespace lifecycle and calico-node restart cycle.Fix
DiagsCollectorto only collect on failureDiagsCollectorwas unconditionally collecting full cluster diagnostics after every test —kubectl getacross all namespaces,calicoctl getfor 5+ resource types,docker execon all nodes,kubectl logsfor calico-node, and 6 confd template files per node. Now it only fires on test failure.Remove redundant retry loops
for i in range(10)loops wrappingretry_until_success(curl, ...)— butretry_until_successalready retries up to 90 times. One call is enough.retries=200, wait_time=5which could block for 16+ minutes on a single attempt.Reduce excessive timeouts and polling intervals
wait_ready/wait_not_ready: 300s → 60s (pods on kind don't need 5 minutes)delete_and_confirm: 10s polling → 2s pollingupdate_ds_env: poll every 3s with 120s timeout instead of 10s polling with no timeoutReduce test scope where overkill
test_graceful_restart: 8 iterations of delete-and-wait-for-calico-node → 3. Each iteration takes ~36s (calico-node startup time), and 3 is plenty to verify no route churn during graceful restart.test_many_services: 300 services → 50. Still exercises the bulk-create-and-verify path without spending a minute creating services.test_methodology: removed an unconditionaltime.sleep(5)after aretry_until_successthat already confirmed BIRD was running.Remove unused namespace from BGP filter tests
The 9
test_bgp_filtertests were each creating and deleting a namespace that nothing was ever deployed into. Pure overhead — ~4s per test for no reason.Fix pre-existing
TestLocalBGPPeerRRregex bugThe regex in
assertRegexwas hardcodingNode_172_18_0_but kind clusters can use different subnets (e.g., 172.19.x.x). Changed to matchNode_followed by anything.