-
Notifications
You must be signed in to change notification settings - Fork 480
Description
Problem Summary
Frontend unit tests in PR workflows are cancelling unexpectedly on the latest commit without any new code pushes.
CRITICAL: This issue is DETERMINISTIC. It fails 100% of the time when frontend tests run (after Dec 23), and 0% of the time when they don't run.
Expected vs Unexpected Cancellations
✅ EXPECTED: Concurrency Control Cancellation
When it happens:
- Developer pushes new commit to PR branch
- Existing workflow run for older commit is automatically cancelled
- New workflow starts for the latest commit
How it appears:
- Job status: "cancelled"
- Reason: Concurrency group policy (cancel-in-progress)
- Message: "This run was cancelled because a newer run was triggered"
❌ UNEXPECTED: Test Hang Cancellation (This Issue)
When it happens:
- No new commits pushed
- Workflow runs on the latest commit
- Tests complete successfully (all pass)
- Process hangs for ~3 minutes with no output
- GitHub Actions detects unresponsiveness
- External shutdown signal sent
How it appears:
- Job status: "cancelled" (not "failed")
- Tests: All pass (e.g., "Tests: 242 passed, 242 total")
- Worker failures: 4-5 occurrences: "A worker process has failed to exit gracefully..."
- ~3 minute silence after tests complete
- Message: "##[error]The runner has received a shutdown signal"
- Message: "##[error]The operation was canceled"
Root Cause Analysis
The Deterministic Pattern
The issue ONLY occurs when:
- PR changes core-web test files
nx affectedselects frontend test projects- "Frontend Unit Tests" job actually runs
The issue NEVER occurs when:
- PRs don't touch core-web
- PRs touch core-web but not test files
nx affectedskips frontend tests- "Frontend Unit Tests" job doesn't run
Why Dec 15-22 Had "No Issues"
Dec 15: PR #34099 merged (introducing the configuration change)
Dec 15-22: All PRs either:
- Didn't touch core-web at all
- Changed core-web but not test files
nx affectedfound no affected tests- Frontend Unit Tests job never ran
- Builds appeared successful
Dec 23: First PR that changed core-web test files
nx affectedfound affected tests- Frontend Unit Tests job ran for first time since PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099
- Job cancelled immediately
- Issue became visible
Breaking Change: PR #34099 (December 15, 2025)
Commit: 6771f7ef98
Title: "34098 task migrate angular projects from browser esbuild to the new application builder"
Merged: December 15, 2025 at 22:25 UTC
Critical Change:
# core-web/nx.json line 7
"options": {
- "parallel": 1
+ "parallel": 3
}Other Changes in Same PR:
- Build system: webpack → esbuild
- TypeScript:
module: es2022,esModuleInterop: true - HMR: Hot module replacement enabled
- Proxy config: WebSocket changes
- Watch mode: Continuous watch enabled
Note: Analysis shows tests run sequentially despite parallel: 3 setting, but the configuration may still affect Nx's internal process management behavior.
Why PR #34099 Itself Didn't Fail
Critical Discovery: PR #34099 HAD THE EXACT SAME ISSUE but succeeded anyway!
PR #34099 Run 20245075272 (Dec 15, 2025):
19:58 - Frontend tests start
20:08:40 - ⚠️ Worker failure #1: "A worker process has failed to exit gracefully..."
20:09:08 - ⚠️ Worker failure #2
20:10:19 - ⚠️ Worker failure #3
[Multiple worker failures throughout]
20:21:12 - ✅ "Ran all test suites" (LAST TEST OUTPUT)
[3 minutes 32 seconds of silence - NO OUTPUT]
20:24:44 - ✅ "BUILD SUCCESS"
20:25:10 - ✅ Job completes successfully
Result: ✅ SUCCESS - No shutdown signal sent
Failing PRs (Dec 23+):
Tests start
[Multiple worker failures]
XX:XX:XX - ✅ "Ran all test suites" (LAST TEST OUTPUT)
[~3 minutes of silence - NO OUTPUT]
XX:XX:XX - ❌ "##[error]The runner has received a shutdown signal"
XX:XX:XX - ❌ "##[error]The operation was canceled"
Result: ❌ CANCELLED - Shutdown signal sent
Key Difference:
- PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099: 3.5 min delay tolerated → Success
- Dec 23+ PRs: 3 min delay NOT tolerated → Cancelled
Most Likely Explanation: GitHub Actions infrastructure changed between Dec 15-23:
- Runner health check timeout may have been reduced
- Health check algorithm changed
- Ubuntu 24.04 runner behavior different
- More aggressive unresponsiveness detection
Underlying Technical Debt
Tests have uncleaned resources (existed BEFORE PR #34099):
- Active timers (setTimeout, setInterval) not cleared
- RxJS subscriptions not unsubscribed
- Unresolved promises
- Zone.js async operations
- Angular change detection intervals
- PrimeNG component internal timers
Result:
- Jest workers fail to exit gracefully (4-5 per run)
- Jest parent process hangs waiting for cleanup
- Process becomes unresponsive (no log output)
- GitHub Actions detects hung process
- External shutdown signal sent (after Dec 23)
- Job cancelled (not failed, because tests passed)
Note: This technical debt existed before PR #34099, but GitHub's tolerance for it appears to have changed.
Evidence
Example Cancelled Run: 20622139223 (Dec 31, 2025)
16:14:54 - ✅ ALL TESTS PASS: "Test Suites: 33 passed, 33 total"
16:14:54 - ⚠️ Jest worker fails to exit gracefully (5th occurrence)
16:14:54 - ✅ "Ran all test suites" - TESTS COMPLETE
[~3 minute hang - Jest waiting for workers to cleanup]
16:17:54 - ❌ "The runner has received a shutdown signal"
16:17:55 - ❌ "The operation was canceled"
Job Presence Analysis
Successful Builds (Dec 15-22):
- Run 20341280316 (Dec 18): ❌ NO "Frontend Unit Tests" job → ✅ Success
- Run 20316649587 (Dec 17): ❌ NO "Frontend Unit Tests" job → ✅ Success
Cancelled Builds (Dec 23+):
- Run 20641205755 (Jan 1): ✅ HAS "Frontend Unit Tests" job → ❌ Cancelled
- Run 20622139223 (Dec 31): ✅ HAS "Frontend Unit Tests" job → ❌ Cancelled
Sequential Execution Observation
Despite parallel: 3 configuration, tests execute sequentially:
16:05:08 - nx run global-store:test ✅
16:05:44 - nx run data-access:test ✅ (36s later)
16:06:57 - nx run edit-content-bridge:test ✅ (73s later)
16:07:14 - nx run edit-ema-ui:test ✅ (17s later)
[Each project starts AFTER previous completes]
This suggests actual parallel execution doesn't happen, but the parallel: 3 configuration may still affect Nx's internal process initialization or resource management.
Affected PRs
Heavily affected (all change core-web test files):
- 34029 browser component #34126 (34029-browser-component) - Changes many test files
- refactor: replace CoreWebService with HttpClient in various services #34165 (issue-34166-usage-ui) - Changes test files
- Update usage dashboard UI to match official dotCMS UI #34168, Issue 34149 update dashboard queries #34179 - Change test files
Pattern: Any PR that changes core-web test files triggers nx affected to run frontend tests, which then cancels 100% of the time.
Solutions
Option 1: Revert Parallelism (Immediate, Recommended for Testing)
Change: core-web/nx.json line 7
- "parallel": 3
+ "parallel": 1Why test this despite PR #34099 working:
- Quick to validate (2-3 runs, issue is deterministic)
- Low risk, easy to revert if ineffective
- May help even if mechanism unclear
- If it works, unblocks PRs immediately
Validation:
- Create PR that touches any core-web test file
- Apply revert
- Test with 2-3 runs (not 20-30, since issue is deterministic)
- If Frontend Unit Tests succeed → Keep revert
- If still cancels → Try other solutions
Pros:
- Simple one-line change
- Easy to test and rollback
- May restore whatever made PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099 succeed
Cons:
- Doesn't explain why PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099 worked with
parallel: 3 - Doesn't fix underlying test quality
- May not help if issue is GitHub infrastructure
Option 2: Force Jest Exit (Bandaid, Not Recommended)
Change: Add to core-web/jest.preset.js:
module.exports = {
...nxPreset,
forceExit: true, // Force Jest to exit after tests complete
// ... existing config
};Pros:
- Jobs complete successfully
- Keeps parallel: 3
Cons:
- Masks the problem completely
- Open handles remain undetected
- Technical debt accumulates
- May hide real issues
- Not addressing root cause
Option 3: Configurable Parallelism (Long-term Enhancement)
Allow different settings for local dev vs CI:
// core-web/nx.json
{
"tasksRunnerOptions": {
"default": {
"options": {
"parallel": process.env.CI ? 1 : 3 // 1 for CI, 3 for local
}
}
}
}Benefits:
- Developers get fast parallel execution locally
- CI gets stable sequential execution
- Can increase CI parallelism after tests are fixed
- Flexible for future adjustments
Proper Fix (REQUIRED Regardless of Other Solutions)
The underlying issue must be fixed even if revert works:
cd core-web
nx affected -t test --base=main --detectOpenHandlesFor each identified test file:
- Add proper cleanup in
afterEach()blocks - Clear timers:
clearTimeout(),clearInterval() - Unsubscribe from RxJS:
.unsubscribe() - Destroy Angular components properly
- Clean up Zone.js async operations
Example:
describe('MyComponent', () => {
let timers: NodeJS.Timeout[] = [];
let subscriptions: Subscription[] = [];
afterEach(() => {
timers.forEach(timer => clearTimeout(timer));
timers = [];
subscriptions.forEach(sub => sub.unsubscribe());
subscriptions = [];
});
it('should work', () => {
const timer = setTimeout(() => { ... }, 1000);
timers.push(timer);
const sub = observable$.subscribe(...);
subscriptions.push(sub);
});
});Why this is required:
- Fixes actual technical debt
- Prevents issues even if GitHub behavior changes again
- Improves test reliability
- May indicate production issues
- Enables future parallelism if desired
Recommended Action Plan
Immediate (Today)
- Apply revert: Change
parallel: 3 → 1incore-web/nx.json - Test on any PR that touches core-web tests (only need 2-3 runs since deterministic)
- If successful: Merge revert to unblock PRs immediately
- If unsuccessful: Try Option 2 (forceExit) as emergency fallback while investigating
Short-Term (This Week)
- Run
nx affected -t test --base=main --detectOpenHandleslocally - Identify all tests with open handles
- Create issues for fixing each leaking test file
- Begin fixing highest-impact tests
Medium-Term (Next Sprint)
- Fix all identified leaking tests
- Remove
forceExitif it was used - Add ESLint rules for common leak patterns:
- Detect uncleaned timers
- Require cleanup in afterEach
- Flag missing unsubscribe
- Implement pre-commit hook with
--detectOpenHandles - Document testing best practices
Long-Term (Next Quarter)
- Consider restoring
parallel: 3ONLY after all tests are clean - Implement configurable parallelism (Option 3)
- Add continuous monitoring for worker failures
- Add CI checks to prevent regression
- Monitor GitHub Actions for infrastructure changes
Key Takeaways
- Issue is DETERMINISTIC: Fails 100% when frontend tests run (after Dec 23), 0% when they don't
- Not probabilistic: Only need 2-3 test runs to validate fixes, not 20-30
- Dec 15-22 "success": PRs simply never triggered frontend tests via
nx affected - Dec 23+ failures: PRs changed test files → triggered frontend tests → revealed issue
- PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099 paradox: Had same worker failures but succeeded - suggests GitHub infrastructure changed
- Revert worth testing: Quick validation even if we don't understand why PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099 worked
- Must fix root cause: Regardless of revert effectiveness, underlying test quality must be addressed
Metadata
Metadata
Assignees
Labels
Type
Projects
Status