Skip to content

ci: add watchdog to kill hung VM test batches before job timeout#12023

Open
tomastigera wants to merge 4 commits intoprojectcalico:masterfrom
tomastigera:ci-watchdog-for-hung-batches
Open

ci: add watchdog to kill hung VM test batches before job timeout#12023
tomastigera wants to merge 4 commits intoprojectcalico:masterfrom
tomastigera:ci-watchdog-for-hung-batches

Conversation

@tomastigera
Copy link
Contributor

@tomastigera tomastigera commented Mar 7, 2026

Summary

Two fixes for CI test batch reliability:

  • Watchdog timer for hung batches: run-tests-on-vms now starts a watchdog that fires 5 minutes before the Semaphore job timeout (JOB_TIMEOUT_MINUTES). It kills any still-running batch subshells, generates JUnit XML failure reports for the timed-out batches, and lets the job exit cleanly as "failed" instead of being hard-killed by Semaphore. This ensures the epilogue runs, artifacts are uploaded, and test results are visible.

  • Fix BPF UT pipeline hang: startBPFLogging() was starting bpftool prog tracelog with inherited stdout/stderr. When the test binary runs in a pipeline (bpf_ut.test |& gotestsum), bpftool inherits the pipe file descriptors. If the test binary exits without killing bpftool (panic, crash, signal), bpftool keeps the pipe open and gotestsum blocks forever. Fixed by redirecting bpftool output to a file and adding a SIGKILL fallback with timeout in stopBPFLogging().

Test plan

  • Watchdog tested on a CI run where the UT batch hung — job exited as failed with JUnit report instead of timing out silently
  • BPF UT hang root-caused and fix verified: bpftool no longer inherits the pipeline pipe

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 7, 2026 00:31
@tomastigera tomastigera requested a review from a team as a code owner March 7, 2026 00:31
@marvin-tigera marvin-tigera added this to the Calico v3.32.0 milestone Mar 7, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Mar 7, 2026
@tomastigera tomastigera added docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact and removed release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Mar 7, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a watchdog mechanism to the VM-based test runner so CI jobs fail cleanly (with JUnit output and artifacts) instead of being force-killed by Semaphore on timeout, improving debuggability and cleanup behavior.

Changes:

  • Add a background watchdog timer to .semaphore/vms/run-tests-on-vms that terminates still-running batch subshells shortly before the Semaphore job timeout and emits per-batch JUnit failures.
  • Set JOB_TIMEOUT_MINUTES for non-default-timeout jobs (e.g., Node KinD at 120m, Felix BPF program-loading check at 30m).
  • Regenerate/update Semaphore pipeline YAML to include the new environment variable settings.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
.semaphore/vms/run-tests-on-vms Adds watchdog to preempt job timeout, kill hung batches, and write JUnit failures into artifacts.
.semaphore/semaphore.yml.d/blocks/20-node.yml Passes JOB_TIMEOUT_MINUTES=120 to the KinD VM test job.
.semaphore/semaphore.yml.d/blocks/20-felix.yml Passes JOB_TIMEOUT_MINUTES=30 to the 30-minute Felix VM test job.
.semaphore/semaphore.yml Generated pipeline YAML updated to include the new env var settings.
.semaphore/semaphore-scheduled-builds.yml Scheduled-build pipeline YAML updated to include the new env var settings.

tomastigera and others added 2 commits March 7, 2026 10:01
When a test batch hangs (e.g. a UT getting stuck), the wait loop in
run-tests-on-vms blocks forever until Semaphore force-kills the job.
This means no results summary, no artifacts uploaded, and no JUnit
reports — the job just shows as "timed out" with no indication of
which batch was stuck.

Add a watchdog timer that fires 5 minutes before the job timeout. It
identifies and kills any still-running batch subshells, generates a
JUnit XML failure report for each timed-out batch, and lets the script
exit cleanly. This ensures:
- The epilogue runs (artifacts uploaded, VMs cleaned up)
- A clear JUnit test failure identifies the hung batch
- The job shows as "failed" instead of "timed out"

Set JOB_TIMEOUT_MINUTES for jobs with non-default timeouts (Node
kind-cluster at 120m, Felix BPF program-loading at 30m).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…meouts

- Use _ut_ in filename for ut batch, _fv_ for FV batches, so
  publish-reports classifies them correctly
- Guard against JOB_TIMEOUT_MINUTES <= 5 which would produce a
  negative sleep duration
- Guard watchdog cleanup for when watchdog was disabled

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tomastigera tomastigera force-pushed the ci-watchdog-for-hung-batches branch from 9970a18 to 5777cfa Compare March 7, 2026 18:05
startBPFLogging() starts "bpftool prog tracelog" to capture BPF trace
output during tests. Previously it set cmd.Stdout/Stderr = os.Stdout/
os.Stderr, which means bpftool inherits the test binary's file
descriptors. When the test binary runs in a pipeline:

  bpf_ut.test |& gotestsum --raw-command -- go tool test2json

bpftool inherits the write end of the pipe to gotestsum. If the test
binary exits without successfully killing bpftool (e.g. due to a panic,
signal, or os.Exit before stopBPFLogging runs), bpftool keeps the pipe
open indefinitely and gotestsum blocks forever waiting for EOF. This
manifests as the entire UT batch hanging until the CI job times out.

Fix by redirecting bpftool output to /tmp/bpf-trace.log instead of
inheriting the process stdout/stderr. This breaks the pipe inheritance
so bpftool can never block the pipeline. Also add a SIGKILL fallback
with timeout in stopBPFLogging, since bpftool may be blocked in
ring_buffer_wait and not respond to SIGTERM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
err := cmd.Start()
cmd.Stdout = f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the FVs, we use a pipe and capture the logs into the log stream of the test; are you happy with sending to a temp file instead?

In any case, you should make sure this goes into the job artifacts so we have it in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the file-redirect approach. Instead, bpftool now keeps writing to stdout/stderr (so trace logs stay inline in the test output) and we set Pdeathsig: syscall.SIGKILL on the child process. This tells the kernel to automatically SIGKILL bpftool when the parent thread exits — so if the test binary panics or crashes without running stopBPFLogging, bpftool is killed immediately and releases the pipe FDs, letting gotestsum get EOF instead of blocking forever.

The stopBPFLogging SIGKILL-with-timeout fallback is kept for clean shutdown.

value: "bpf-25.10-nft-no-fv-with-ut"
- name: FELIX_FV_BPFATTACHTYPE
value: "tc"
- name: JOB_TIMEOUT_MINUTES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think a few more jobs need this

- Write BPF trace log to report/ dir so collect-artifacts picks it up
  and it's available in CI job artifacts (fasaxc)
- Add JOB_TIMEOUT_MINUTES=60 to all Felix VM test jobs so the watchdog
  timeout matches the Semaphore execution_time_limit (fasaxc)
- Regenerate semaphore YAML

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tomastigera tomastigera force-pushed the ci-watchdog-for-hung-batches branch from 2a0b3ae to ec598e0 Compare March 10, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-not-required Change has no user-facing impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants