Rationale
Five stress tests (fd-exhaust, rapid-fork, concurrent-io, signal-race, long-running) run via make check-stress but failures are swallowed with an informational message. Permanently non-blocking status means regressions go undetected and the tests provide no CI value.
Proposed Changes
- Triage existing failures: run each stress test in isolation on both x86_64 and aarch64, document which tests fail, under what conditions, and whether failures are kbox bugs or environmental noise.
- Selective promotion: promote deterministic tests (fd-exhaust and rapid-fork are candidates) to blocking status. Keep timing-dependent tests (signal-race) as non-blocking with expected-failure annotations.
- Retry after triage: add retry logic (3 attempts) only for tests proven to be timing-sensitive, to avoid masking real instability.
- Revisit LSAN suppressions:
lsan-suppressions.txt already suppresses LKL semaphore leaks in posix-host.c. Determine if these are upstream LKL issues or kbox integration bugs, and update suppressions as needed.
Considerations
STRESS_TIMEOUT is already configurable via environment variable; consider per-test timeouts or CI-specific defaults rather than a global increase
- Stress binaries are built with
-O2 -static (no ASAN/UBSAN), testing different codepaths than unit tests
- signal-race is the most likely to be flaky (depends on SIGALRM delivery timing)
- Blocking on all stress tests would slow CI; selective promotion is the pragmatic path
- Triage must come before retries -- adding retries first risks masking real instability
References
tests/stress/ : stress test source files
scripts/run-stress.sh : test runner with per-test timeout and LSAN configuration
Makefile : check-stress target swallows failures
scripts/lsan-suppressions.txt : existing LSAN suppressions
Rationale
Five stress tests (fd-exhaust, rapid-fork, concurrent-io, signal-race, long-running) run via
make check-stressbut failures are swallowed with an informational message. Permanently non-blocking status means regressions go undetected and the tests provide no CI value.Proposed Changes
lsan-suppressions.txtalready suppresses LKL semaphore leaks inposix-host.c. Determine if these are upstream LKL issues or kbox integration bugs, and update suppressions as needed.Considerations
STRESS_TIMEOUTis already configurable via environment variable; consider per-test timeouts or CI-specific defaults rather than a global increase-O2 -static(no ASAN/UBSAN), testing different codepaths than unit testsReferences
tests/stress/: stress test source filesscripts/run-stress.sh: test runner with per-test timeout and LSAN configurationMakefile:check-stresstarget swallows failuresscripts/lsan-suppressions.txt: existing LSAN suppressions