Skip to content

ci: bounded-retry M-series interop jobs#185

Merged
lance0 merged 1 commit into
mainfrom
chore/ci-interop-retry
May 19, 2026
Merged

ci: bounded-retry M-series interop jobs#185
lance0 merged 1 commit into
mainfrom
chore/ci-interop-retry

Conversation

@lance0
Copy link
Copy Markdown
Owner

@lance0 lance0 commented May 19, 2026

Summary

Adds a composite action that wraps the destroy → deploy → run → destroy-on-exit lifecycle for every M-series interop job with a 2-attempt retry on transient failure. A successful retry emits a workflow ::warning:: annotation so CI flake stays visible in the UI rather than silently hiding.

Today alone we had 4 M-series transient failures (M14, M38, M42, M36 ×2) that all passed on rerun — all symptoms of self-hosted runner cohabitation, tight 60s timing gates, and stale-clab accumulation rather than real regressions. This converts those single-shot transients from "PR blocker until I notice and gh run rerun" to "30s delay nobody sees, with a warning chip in the UI."

What's migrated

  • kernel-dataplane.yml (self-hosted): M36, M37, M37+IP, M38, M39, M40, M42, M43
  • interop.yml (hosted ubuntu-latest): M1, M10, M13, M14, M15, M17, M22, M24, M25, M29, M30, M34, M35, M35b, M35c, M41

Each migrated job drops from ~12 lines of YAML to a single uses: block.

What's NOT retried (intentionally)

  • build-image, msrv, cargo clippy, cargo doc, security audit — failures here are real regressions. Adding retry would hide compile failures behind transient noise.
  • M43's Detect TCP-AO kernel support pre-check step — that's a one-shot probe, not a topology lifecycle.

Timing-gate bump

M38's "PE2 promotes to DF after PE1 shutdown" widened from 60s → 120s. DF re-election depends on BGP hold-timer expiry on PE2 after PE1's session dies, which slips past a tight 60s window under host contention. Widening costs nothing in the success case and absorbs the observed jitter.

Verification

  • YAML parse: python3 -c "import yaml; yaml.safe_load(...)" on all three modified workflow files
  • Bash syntax: bash -n tests/interop/scripts/test-m38-evpn-df-election.sh
  • The composite is invoked identically in both workflows; topology argument matches the existing destroy/deploy file paths exactly (verified by grep).

Failure-mode visibility

A successful retry emits ::warning::Mxx passed after 2 attempts (retry absorbed transient failure). If a job is genuinely flaky over time, the warning chip accumulates on every successful PR — the issue is visible, not hidden. Per-job retry-rate telemetry is filed as a follow-up.

Follow-ups (separate PRs)

  • Periodic stale-clab sweep on the self-hosted runner via systemd timer (prevents the multi-week accumulation we cleaned up today during the kernel upgrade)
  • Per-job retry-rate telemetry to identify creeping fragility over time

Adds a composite action that wraps the destroy → deploy → run →
destroy-on-exit lifecycle with 2-attempt retry on transient failure.
The composite always destroys the topology before each attempt so
stale state from a prior failed run cannot leak in, and always
destroys on exit regardless of outcome.

Successful retry emits a workflow ::warning:: annotation so CI flake
stays visible in the UI rather than silently passing.

Migrated jobs:

  kernel-dataplane.yml: M36, M37, M37+IP, M38, M39, M40, M42, M43
  interop.yml: M1, M10, M13, M14, M15, M17, M22, M24, M25, M29,
               M30, M34, M35, M35b, M35c, M41

Each M-series job drops ~12 lines of deploy/test/destroy YAML to one
`uses:` block.

Also widens M38's PE2-promotes-to-DF gate from 60s to 120s. DF
re-election depends on PE1's session hold-timer expiry, which can
slip past 60s when the runner is sharing CPU with release-time
concurrent workflows. The 120s gate has no cost in the success case
and absorbs the observed timing-jitter that produced today's two
M38 PR flakes.

Not retried (real regression signal): build-image, msrv, clippy,
doc, security-audit. Adding retry there would hide compile failures
behind transient noise.

Verification:

- python3 -c "import yaml; yaml.safe_load(...)" on all 3 yaml files
- bash -n on the modified M38 script
- Composite is invoked the same way in both workflows; topology
  argument matches the existing destroy/deploy file paths exactly.

Follow-ups (separate issues to file):

- Periodic stale-clab sweep on the self-hosted runner via systemd
  timer (prevents the multi-week accumulation we cleaned up today).
- Per-job retry-rate telemetry to identify creeping fragility.
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:13 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:13 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:19 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:19 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:19 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:19 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:19 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:20 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:20 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 19, 2026 18:21 — with GitHub Actions Inactive
@lance0 lance0 merged commit 512123b into main May 19, 2026
32 checks passed
@lance0 lance0 deleted the chore/ci-interop-retry branch May 19, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant