ci: add watchdog to kill hung VM test batches before job timeout#12023
ci: add watchdog to kill hung VM test batches before job timeout#12023tomastigera wants to merge 4 commits intoprojectcalico:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a watchdog mechanism to the VM-based test runner so CI jobs fail cleanly (with JUnit output and artifacts) instead of being force-killed by Semaphore on timeout, improving debuggability and cleanup behavior.
Changes:
- Add a background watchdog timer to
.semaphore/vms/run-tests-on-vmsthat terminates still-running batch subshells shortly before the Semaphore job timeout and emits per-batch JUnit failures. - Set
JOB_TIMEOUT_MINUTESfor non-default-timeout jobs (e.g., Node KinD at 120m, Felix BPF program-loading check at 30m). - Regenerate/update Semaphore pipeline YAML to include the new environment variable settings.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.semaphore/vms/run-tests-on-vms |
Adds watchdog to preempt job timeout, kill hung batches, and write JUnit failures into artifacts. |
.semaphore/semaphore.yml.d/blocks/20-node.yml |
Passes JOB_TIMEOUT_MINUTES=120 to the KinD VM test job. |
.semaphore/semaphore.yml.d/blocks/20-felix.yml |
Passes JOB_TIMEOUT_MINUTES=30 to the 30-minute Felix VM test job. |
.semaphore/semaphore.yml |
Generated pipeline YAML updated to include the new env var settings. |
.semaphore/semaphore-scheduled-builds.yml |
Scheduled-build pipeline YAML updated to include the new env var settings. |
When a test batch hangs (e.g. a UT getting stuck), the wait loop in run-tests-on-vms blocks forever until Semaphore force-kills the job. This means no results summary, no artifacts uploaded, and no JUnit reports — the job just shows as "timed out" with no indication of which batch was stuck. Add a watchdog timer that fires 5 minutes before the job timeout. It identifies and kills any still-running batch subshells, generates a JUnit XML failure report for each timed-out batch, and lets the script exit cleanly. This ensures: - The epilogue runs (artifacts uploaded, VMs cleaned up) - A clear JUnit test failure identifies the hung batch - The job shows as "failed" instead of "timed out" Set JOB_TIMEOUT_MINUTES for jobs with non-default timeouts (Node kind-cluster at 120m, Felix BPF program-loading at 30m). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…meouts - Use _ut_ in filename for ut batch, _fv_ for FV batches, so publish-reports classifies them correctly - Guard against JOB_TIMEOUT_MINUTES <= 5 which would produce a negative sleep duration - Guard watchdog cleanup for when watchdog was disabled Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9970a18 to
5777cfa
Compare
startBPFLogging() starts "bpftool prog tracelog" to capture BPF trace output during tests. Previously it set cmd.Stdout/Stderr = os.Stdout/ os.Stderr, which means bpftool inherits the test binary's file descriptors. When the test binary runs in a pipeline: bpf_ut.test |& gotestsum --raw-command -- go tool test2json bpftool inherits the write end of the pipe to gotestsum. If the test binary exits without successfully killing bpftool (e.g. due to a panic, signal, or os.Exit before stopBPFLogging runs), bpftool keeps the pipe open indefinitely and gotestsum blocks forever waiting for EOF. This manifests as the entire UT batch hanging until the CI job times out. Fix by redirecting bpftool output to /tmp/bpf-trace.log instead of inheriting the process stdout/stderr. This breaks the pipe inheritance so bpftool can never block the pipeline. Also add a SIGKILL fallback with timeout in stopBPFLogging, since bpftool may be blocked in ring_buffer_wait and not respond to SIGTERM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
felix/bpf/ut/bpf_prog_test.go
Outdated
| cmd.Stdout = os.Stdout | ||
| cmd.Stderr = os.Stderr | ||
| err := cmd.Start() | ||
| cmd.Stdout = f |
There was a problem hiding this comment.
In the FVs, we use a pipe and capture the logs into the log stream of the test; are you happy with sending to a temp file instead?
In any case, you should make sure this goes into the job artifacts so we have it in CI.
There was a problem hiding this comment.
Reverted the file-redirect approach. Instead, bpftool now keeps writing to stdout/stderr (so trace logs stay inline in the test output) and we set Pdeathsig: syscall.SIGKILL on the child process. This tells the kernel to automatically SIGKILL bpftool when the parent thread exits — so if the test binary panics or crashes without running stopBPFLogging, bpftool is killed immediately and releases the pipe FDs, letting gotestsum get EOF instead of blocking forever.
The stopBPFLogging SIGKILL-with-timeout fallback is kept for clean shutdown.
| value: "bpf-25.10-nft-no-fv-with-ut" | ||
| - name: FELIX_FV_BPFATTACHTYPE | ||
| value: "tc" | ||
| - name: JOB_TIMEOUT_MINUTES |
- Write BPF trace log to report/ dir so collect-artifacts picks it up and it's available in CI job artifacts (fasaxc) - Add JOB_TIMEOUT_MINUTES=60 to all Felix VM test jobs so the watchdog timeout matches the Semaphore execution_time_limit (fasaxc) - Regenerate semaphore YAML Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2a0b3ae to
ec598e0
Compare
Summary
Two fixes for CI test batch reliability:
Watchdog timer for hung batches:
run-tests-on-vmsnow starts a watchdog that fires 5 minutes before the Semaphore job timeout (JOB_TIMEOUT_MINUTES). It kills any still-running batch subshells, generates JUnit XML failure reports for the timed-out batches, and lets the job exit cleanly as "failed" instead of being hard-killed by Semaphore. This ensures the epilogue runs, artifacts are uploaded, and test results are visible.Fix BPF UT pipeline hang:
startBPFLogging()was startingbpftool prog tracelogwith inherited stdout/stderr. When the test binary runs in a pipeline (bpf_ut.test |& gotestsum), bpftool inherits the pipe file descriptors. If the test binary exits without killing bpftool (panic, crash, signal), bpftool keeps the pipe open and gotestsum blocks forever. Fixed by redirecting bpftool output to a file and adding a SIGKILL fallback with timeout instopBPFLogging().Test plan
🤖 Generated with Claude Code