ci: bounded-retry M-series interop jobs#185
Merged
Merged
Conversation
Adds a composite action that wraps the destroy → deploy → run →
destroy-on-exit lifecycle with 2-attempt retry on transient failure.
The composite always destroys the topology before each attempt so
stale state from a prior failed run cannot leak in, and always
destroys on exit regardless of outcome.
Successful retry emits a workflow ::warning:: annotation so CI flake
stays visible in the UI rather than silently passing.
Migrated jobs:
kernel-dataplane.yml: M36, M37, M37+IP, M38, M39, M40, M42, M43
interop.yml: M1, M10, M13, M14, M15, M17, M22, M24, M25, M29,
M30, M34, M35, M35b, M35c, M41
Each M-series job drops ~12 lines of deploy/test/destroy YAML to one
`uses:` block.
Also widens M38's PE2-promotes-to-DF gate from 60s to 120s. DF
re-election depends on PE1's session hold-timer expiry, which can
slip past 60s when the runner is sharing CPU with release-time
concurrent workflows. The 120s gate has no cost in the success case
and absorbs the observed timing-jitter that produced today's two
M38 PR flakes.
Not retried (real regression signal): build-image, msrv, clippy,
doc, security-audit. Adding retry there would hide compile failures
behind transient noise.
Verification:
- python3 -c "import yaml; yaml.safe_load(...)" on all 3 yaml files
- bash -n on the modified M38 script
- Composite is invoked the same way in both workflows; topology
argument matches the existing destroy/deploy file paths exactly.
Follow-ups (separate issues to file):
- Periodic stale-clab sweep on the self-hosted runner via systemd
timer (prevents the multi-week accumulation we cleaned up today).
- Per-job retry-rate telemetry to identify creeping fragility.
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a composite action that wraps the
destroy → deploy → run → destroy-on-exitlifecycle for every M-series interop job with a 2-attempt retry on transient failure. A successful retry emits a workflow::warning::annotation so CI flake stays visible in the UI rather than silently hiding.Today alone we had 4 M-series transient failures (M14, M38, M42, M36 ×2) that all passed on rerun — all symptoms of self-hosted runner cohabitation, tight 60s timing gates, and stale-clab accumulation rather than real regressions. This converts those single-shot transients from "PR blocker until I notice and
gh run rerun" to "30s delay nobody sees, with a warning chip in the UI."What's migrated
kernel-dataplane.yml(self-hosted): M36, M37, M37+IP, M38, M39, M40, M42, M43interop.yml(hosted ubuntu-latest): M1, M10, M13, M14, M15, M17, M22, M24, M25, M29, M30, M34, M35, M35b, M35c, M41Each migrated job drops from ~12 lines of YAML to a single
uses:block.What's NOT retried (intentionally)
build-image,msrv,cargo clippy,cargo doc,security audit— failures here are real regressions. Adding retry would hide compile failures behind transient noise.Detect TCP-AO kernel supportpre-check step — that's a one-shot probe, not a topology lifecycle.Timing-gate bump
M38's "PE2 promotes to DF after PE1 shutdown" widened from 60s → 120s. DF re-election depends on BGP hold-timer expiry on PE2 after PE1's session dies, which slips past a tight 60s window under host contention. Widening costs nothing in the success case and absorbs the observed jitter.
Verification
python3 -c "import yaml; yaml.safe_load(...)"on all three modified workflow filesbash -n tests/interop/scripts/test-m38-evpn-df-election.shFailure-mode visibility
A successful retry emits
::warning::Mxx passed after 2 attempts (retry absorbed transient failure). If a job is genuinely flaky over time, the warning chip accumulates on every successful PR — the issue is visible, not hidden. Per-job retry-rate telemetry is filed as a follow-up.Follow-ups (separate PRs)