Skip to content

CI: record per-job interop retry-rate telemetry #187

@lance0

Description

@lance0

Context

PR #185 made privileged M-series interop jobs more tolerant of transient containerlab failures by retrying each topology in .github/actions/run-interop-test. The action emits warning annotations when a retry succeeds, but there is no durable per-job summary that shows how often retries are being consumed across runs.

Expected direction

Add lightweight retry-rate telemetry for interop jobs so we can distinguish real stability improvements from flake masking. This should stay CI-local and low-risk: job summaries, uploaded JSON/Markdown artifacts, or workflow annotations are enough; no external metrics service is required for the first slice.

Acceptance criteria

  • Each interop job records topology label, attempts used, final result, and whether a retry absorbed a transient failure.
  • The workflow summary exposes retry counts in a way reviewers can see from the Actions UI.
  • Failed jobs still surface the first failing attempt clearly enough for debugging.
  • Telemetry does not change pass/fail semantics from PR ci: bounded-retry M-series interop jobs #185.
  • Add docs or comments for how maintainers should interpret retry-rate trends.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions