Skip to content

Frontend test cancellations caused by nx parallel: 3 change exposing uncleaned test resources #34194

@spbolton

Description

@spbolton

Problem Summary

Frontend unit tests in PR workflows are cancelling unexpectedly on the latest commit without any new code pushes.

CRITICAL: This issue is DETERMINISTIC. It fails 100% of the time when frontend tests run (after Dec 23), and 0% of the time when they don't run.

Expected vs Unexpected Cancellations

✅ EXPECTED: Concurrency Control Cancellation

When it happens:

  • Developer pushes new commit to PR branch
  • Existing workflow run for older commit is automatically cancelled
  • New workflow starts for the latest commit

How it appears:

  • Job status: "cancelled"
  • Reason: Concurrency group policy (cancel-in-progress)
  • Message: "This run was cancelled because a newer run was triggered"

❌ UNEXPECTED: Test Hang Cancellation (This Issue)

When it happens:

  • No new commits pushed
  • Workflow runs on the latest commit
  • Tests complete successfully (all pass)
  • Process hangs for ~3 minutes with no output
  • GitHub Actions detects unresponsiveness
  • External shutdown signal sent

How it appears:

  • Job status: "cancelled" (not "failed")
  • Tests: All pass (e.g., "Tests: 242 passed, 242 total")
  • Worker failures: 4-5 occurrences: "A worker process has failed to exit gracefully..."
  • ~3 minute silence after tests complete
  • Message: "##[error]The runner has received a shutdown signal"
  • Message: "##[error]The operation was canceled"

Root Cause Analysis

The Deterministic Pattern

The issue ONLY occurs when:

  1. PR changes core-web test files
  2. nx affected selects frontend test projects
  3. "Frontend Unit Tests" job actually runs

The issue NEVER occurs when:

  • PRs don't touch core-web
  • PRs touch core-web but not test files
  • nx affected skips frontend tests
  • "Frontend Unit Tests" job doesn't run

Why Dec 15-22 Had "No Issues"

Dec 15: PR #34099 merged (introducing the configuration change)
Dec 15-22: All PRs either:

  • Didn't touch core-web at all
  • Changed core-web but not test files
  • nx affected found no affected tests
  • Frontend Unit Tests job never ran
  • Builds appeared successful

Dec 23: First PR that changed core-web test files

Breaking Change: PR #34099 (December 15, 2025)

Commit: 6771f7ef98
Title: "34098 task migrate angular projects from browser esbuild to the new application builder"
Merged: December 15, 2025 at 22:25 UTC

Critical Change:

# core-web/nx.json line 7
"options": {
-   "parallel": 1
+   "parallel": 3
}

Other Changes in Same PR:

  • Build system: webpack → esbuild
  • TypeScript: module: es2022, esModuleInterop: true
  • HMR: Hot module replacement enabled
  • Proxy config: WebSocket changes
  • Watch mode: Continuous watch enabled

Note: Analysis shows tests run sequentially despite parallel: 3 setting, but the configuration may still affect Nx's internal process management behavior.

Why PR #34099 Itself Didn't Fail

Critical Discovery: PR #34099 HAD THE EXACT SAME ISSUE but succeeded anyway!

PR #34099 Run 20245075272 (Dec 15, 2025):

19:58 - Frontend tests start
20:08:40 - ⚠️  Worker failure #1: "A worker process has failed to exit gracefully..."
20:09:08 - ⚠️  Worker failure #2
20:10:19 - ⚠️  Worker failure #3
[Multiple worker failures throughout]
20:21:12 - ✅ "Ran all test suites" (LAST TEST OUTPUT)
[3 minutes 32 seconds of silence - NO OUTPUT]
20:24:44 - ✅ "BUILD SUCCESS"
20:25:10 - ✅ Job completes successfully

Result:SUCCESS - No shutdown signal sent

Failing PRs (Dec 23+):

Tests start
[Multiple worker failures]
XX:XX:XX - ✅ "Ran all test suites" (LAST TEST OUTPUT)
[~3 minutes of silence - NO OUTPUT]
XX:XX:XX - ❌ "##[error]The runner has received a shutdown signal"
XX:XX:XX - ❌ "##[error]The operation was canceled"

Result:CANCELLED - Shutdown signal sent

Key Difference:

Most Likely Explanation: GitHub Actions infrastructure changed between Dec 15-23:

  • Runner health check timeout may have been reduced
  • Health check algorithm changed
  • Ubuntu 24.04 runner behavior different
  • More aggressive unresponsiveness detection

Underlying Technical Debt

Tests have uncleaned resources (existed BEFORE PR #34099):

  • Active timers (setTimeout, setInterval) not cleared
  • RxJS subscriptions not unsubscribed
  • Unresolved promises
  • Zone.js async operations
  • Angular change detection intervals
  • PrimeNG component internal timers

Result:

  • Jest workers fail to exit gracefully (4-5 per run)
  • Jest parent process hangs waiting for cleanup
  • Process becomes unresponsive (no log output)
  • GitHub Actions detects hung process
  • External shutdown signal sent (after Dec 23)
  • Job cancelled (not failed, because tests passed)

Note: This technical debt existed before PR #34099, but GitHub's tolerance for it appears to have changed.

Evidence

Example Cancelled Run: 20622139223 (Dec 31, 2025)

16:14:54 - ✅ ALL TESTS PASS: "Test Suites: 33 passed, 33 total"
16:14:54 - ⚠️  Jest worker fails to exit gracefully (5th occurrence)
16:14:54 - ✅ "Ran all test suites" - TESTS COMPLETE
[~3 minute hang - Jest waiting for workers to cleanup]
16:17:54 - ❌ "The runner has received a shutdown signal"
16:17:55 - ❌ "The operation was canceled"

Job Presence Analysis

Successful Builds (Dec 15-22):

  • Run 20341280316 (Dec 18): ❌ NO "Frontend Unit Tests" job → ✅ Success
  • Run 20316649587 (Dec 17): ❌ NO "Frontend Unit Tests" job → ✅ Success

Cancelled Builds (Dec 23+):

  • Run 20641205755 (Jan 1): ✅ HAS "Frontend Unit Tests" job → ❌ Cancelled
  • Run 20622139223 (Dec 31): ✅ HAS "Frontend Unit Tests" job → ❌ Cancelled

Sequential Execution Observation

Despite parallel: 3 configuration, tests execute sequentially:

16:05:08 - nx run global-store:test ✅
16:05:44 - nx run data-access:test ✅        (36s later)
16:06:57 - nx run edit-content-bridge:test ✅ (73s later)
16:07:14 - nx run edit-ema-ui:test ✅        (17s later)
[Each project starts AFTER previous completes]

This suggests actual parallel execution doesn't happen, but the parallel: 3 configuration may still affect Nx's internal process initialization or resource management.

Affected PRs

Heavily affected (all change core-web test files):

Pattern: Any PR that changes core-web test files triggers nx affected to run frontend tests, which then cancels 100% of the time.

Solutions

Option 1: Revert Parallelism (Immediate, Recommended for Testing)

Change: core-web/nx.json line 7

-   "parallel": 3
+   "parallel": 1

Why test this despite PR #34099 working:

  • Quick to validate (2-3 runs, issue is deterministic)
  • Low risk, easy to revert if ineffective
  • May help even if mechanism unclear
  • If it works, unblocks PRs immediately

Validation:

  • Create PR that touches any core-web test file
  • Apply revert
  • Test with 2-3 runs (not 20-30, since issue is deterministic)
  • If Frontend Unit Tests succeed → Keep revert
  • If still cancels → Try other solutions

Pros:

Cons:

Option 2: Force Jest Exit (Bandaid, Not Recommended)

Change: Add to core-web/jest.preset.js:

module.exports = {
    ...nxPreset,
    forceExit: true,  // Force Jest to exit after tests complete
    // ... existing config
};

Pros:

  • Jobs complete successfully
  • Keeps parallel: 3

Cons:

  • Masks the problem completely
  • Open handles remain undetected
  • Technical debt accumulates
  • May hide real issues
  • Not addressing root cause

Option 3: Configurable Parallelism (Long-term Enhancement)

Allow different settings for local dev vs CI:

// core-web/nx.json
{
  "tasksRunnerOptions": {
    "default": {
      "options": {
        "parallel": process.env.CI ? 1 : 3  // 1 for CI, 3 for local
      }
    }
  }
}

Benefits:

  • Developers get fast parallel execution locally
  • CI gets stable sequential execution
  • Can increase CI parallelism after tests are fixed
  • Flexible for future adjustments

Proper Fix (REQUIRED Regardless of Other Solutions)

The underlying issue must be fixed even if revert works:

cd core-web
nx affected -t test --base=main --detectOpenHandles

For each identified test file:

  1. Add proper cleanup in afterEach() blocks
  2. Clear timers: clearTimeout(), clearInterval()
  3. Unsubscribe from RxJS: .unsubscribe()
  4. Destroy Angular components properly
  5. Clean up Zone.js async operations

Example:

describe('MyComponent', () => {
  let timers: NodeJS.Timeout[] = [];
  let subscriptions: Subscription[] = [];

  afterEach(() => {
    timers.forEach(timer => clearTimeout(timer));
    timers = [];
    subscriptions.forEach(sub => sub.unsubscribe());
    subscriptions = [];
  });

  it('should work', () => {
    const timer = setTimeout(() => { ... }, 1000);
    timers.push(timer);

    const sub = observable$.subscribe(...);
    subscriptions.push(sub);
  });
});

Why this is required:

  • Fixes actual technical debt
  • Prevents issues even if GitHub behavior changes again
  • Improves test reliability
  • May indicate production issues
  • Enables future parallelism if desired

Recommended Action Plan

Immediate (Today)

  1. Apply revert: Change parallel: 3 → 1 in core-web/nx.json
  2. Test on any PR that touches core-web tests (only need 2-3 runs since deterministic)
  3. If successful: Merge revert to unblock PRs immediately
  4. If unsuccessful: Try Option 2 (forceExit) as emergency fallback while investigating

Short-Term (This Week)

  1. Run nx affected -t test --base=main --detectOpenHandles locally
  2. Identify all tests with open handles
  3. Create issues for fixing each leaking test file
  4. Begin fixing highest-impact tests

Medium-Term (Next Sprint)

  1. Fix all identified leaking tests
  2. Remove forceExit if it was used
  3. Add ESLint rules for common leak patterns:
    • Detect uncleaned timers
    • Require cleanup in afterEach
    • Flag missing unsubscribe
  4. Implement pre-commit hook with --detectOpenHandles
  5. Document testing best practices

Long-Term (Next Quarter)

  1. Consider restoring parallel: 3 ONLY after all tests are clean
  2. Implement configurable parallelism (Option 3)
  3. Add continuous monitoring for worker failures
  4. Add CI checks to prevent regression
  5. Monitor GitHub Actions for infrastructure changes

Key Takeaways

  1. Issue is DETERMINISTIC: Fails 100% when frontend tests run (after Dec 23), 0% when they don't
  2. Not probabilistic: Only need 2-3 test runs to validate fixes, not 20-30
  3. Dec 15-22 "success": PRs simply never triggered frontend tests via nx affected
  4. Dec 23+ failures: PRs changed test files → triggered frontend tests → revealed issue
  5. PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099 paradox: Had same worker failures but succeeded - suggests GitHub infrastructure changed
  6. Revert worth testing: Quick validation even if we don't understand why PR 34098 task migrate angular projects from browser esbuild to the new application builder #34099 worked
  7. Must fix root cause: Regardless of revert effectiveness, underlying test quality must be addressed

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions