Frontend test cancellations caused by nx parallel: 3 change exposing uncleaned test resources

## Problem Summary

Frontend unit tests in PR workflows are **cancelling unexpectedly** on the latest commit **without any new code pushes**.

**CRITICAL: This issue is DETERMINISTIC.** It fails **100% of the time** when frontend tests run (after Dec 23), and **0% of the time** when they don't run.

## Expected vs Unexpected Cancellations

### ✅ EXPECTED: Concurrency Control Cancellation
**When it happens:**
- Developer pushes new commit to PR branch
- Existing workflow run for older commit is automatically cancelled
- New workflow starts for the latest commit

**How it appears:**
- Job status: "cancelled"
- Reason: Concurrency group policy (cancel-in-progress)
- Message: "This run was cancelled because a newer run was triggered"

### ❌ UNEXPECTED: Test Hang Cancellation (This Issue)
**When it happens:**
- No new commits pushed
- Workflow runs on the **latest commit**
- Tests complete successfully (all pass)
- Process hangs for ~3 minutes with no output
- GitHub Actions detects unresponsiveness
- External shutdown signal sent

**How it appears:**
- Job status: "cancelled" (not "failed")
- Tests: All pass (e.g., "Tests: 242 passed, 242 total")
- Worker failures: 4-5 occurrences: "A worker process has failed to exit gracefully..."
- ~3 minute silence after tests complete
- Message: "##[error]The runner has received a shutdown signal"
- Message: "##[error]The operation was canceled"

## Root Cause Analysis

### The Deterministic Pattern

**The issue ONLY occurs when:**
1. PR changes core-web test files
2. `nx affected` selects frontend test projects
3. "Frontend Unit Tests" job actually runs

**The issue NEVER occurs when:**
- PRs don't touch core-web
- PRs touch core-web but not test files
- `nx affected` skips frontend tests
- "Frontend Unit Tests" job doesn't run

### Why Dec 15-22 Had "No Issues"

**Dec 15:** PR #34099 merged (introducing the configuration change)
**Dec 15-22:** All PRs either:
- Didn't touch core-web at all
- Changed core-web but not test files
- `nx affected` found no affected tests
- **Frontend Unit Tests job never ran**
- Builds appeared successful

**Dec 23:** First PR that changed core-web test files
- `nx affected` found affected tests
- **Frontend Unit Tests job ran for first time since PR #34099**
- **Job cancelled immediately**
- Issue became visible

### Breaking Change: PR #34099 (December 15, 2025)

**Commit:** `6771f7ef98`
**Title:** "34098 task migrate angular projects from browser esbuild to the new application builder"
**Merged:** December 15, 2025 at 22:25 UTC

**Critical Change:**
```diff
# core-web/nx.json line 7
"options": {
-   "parallel": 1
+   "parallel": 3
}
```

**Other Changes in Same PR:**
- Build system: webpack → esbuild
- TypeScript: `module: es2022`, `esModuleInterop: true`
- HMR: Hot module replacement enabled
- Proxy config: WebSocket changes
- Watch mode: Continuous watch enabled

**Note:** Analysis shows tests run sequentially despite `parallel: 3` setting, but the configuration may still affect Nx's internal process management behavior.

### Why PR #34099 Itself Didn't Fail

**Critical Discovery:** PR #34099 **HAD THE EXACT SAME ISSUE** but succeeded anyway!

**PR #34099 Run 20245075272 (Dec 15, 2025):**
```
19:58 - Frontend tests start
20:08:40 - ⚠️  Worker failure #1: "A worker process has failed to exit gracefully..."
20:09:08 - ⚠️  Worker failure #2
20:10:19 - ⚠️  Worker failure #3
[Multiple worker failures throughout]
20:21:12 - ✅ "Ran all test suites" (LAST TEST OUTPUT)
[3 minutes 32 seconds of silence - NO OUTPUT]
20:24:44 - ✅ "BUILD SUCCESS"
20:25:10 - ✅ Job completes successfully
```
**Result:** ✅ **SUCCESS** - No shutdown signal sent

**Failing PRs (Dec 23+):**
```
Tests start
[Multiple worker failures]
XX:XX:XX - ✅ "Ran all test suites" (LAST TEST OUTPUT)
[~3 minutes of silence - NO OUTPUT]
XX:XX:XX - ❌ "##[error]The runner has received a shutdown signal"
XX:XX:XX - ❌ "##[error]The operation was canceled"
```
**Result:** ❌ **CANCELLED** - Shutdown signal sent

**Key Difference:**
- PR #34099: 3.5 min delay **tolerated** → Success
- Dec 23+ PRs: 3 min delay **NOT tolerated** → Cancelled

**Most Likely Explanation:** GitHub Actions infrastructure changed between Dec 15-23:
- Runner health check timeout may have been reduced
- Health check algorithm changed
- Ubuntu 24.04 runner behavior different
- More aggressive unresponsiveness detection

### Underlying Technical Debt

**Tests have uncleaned resources (existed BEFORE PR #34099):**
- Active timers (setTimeout, setInterval) not cleared
- RxJS subscriptions not unsubscribed
- Unresolved promises
- Zone.js async operations
- Angular change detection intervals
- PrimeNG component internal timers

**Result:**
- Jest workers fail to exit gracefully (4-5 per run)
- Jest parent process hangs waiting for cleanup
- Process becomes unresponsive (no log output)
- GitHub Actions detects hung process
- External shutdown signal sent (after Dec 23)
- Job cancelled (not failed, because tests passed)

**Note:** This technical debt existed before PR #34099, but GitHub's tolerance for it appears to have changed.

## Evidence

### Example Cancelled Run: 20622139223 (Dec 31, 2025)

```
16:14:54 - ✅ ALL TESTS PASS: "Test Suites: 33 passed, 33 total"
16:14:54 - ⚠️  Jest worker fails to exit gracefully (5th occurrence)
16:14:54 - ✅ "Ran all test suites" - TESTS COMPLETE
[~3 minute hang - Jest waiting for workers to cleanup]
16:17:54 - ❌ "The runner has received a shutdown signal"
16:17:55 - ❌ "The operation was canceled"
```

### Job Presence Analysis

**Successful Builds (Dec 15-22):**
- Run 20341280316 (Dec 18): ❌ NO "Frontend Unit Tests" job → ✅ Success
- Run 20316649587 (Dec 17): ❌ NO "Frontend Unit Tests" job → ✅ Success

**Cancelled Builds (Dec 23+):**
- Run 20641205755 (Jan 1): ✅ HAS "Frontend Unit Tests" job → ❌ Cancelled
- Run 20622139223 (Dec 31): ✅ HAS "Frontend Unit Tests" job → ❌ Cancelled

### Sequential Execution Observation

Despite `parallel: 3` configuration, tests execute sequentially:
```
16:05:08 - nx run global-store:test ✅
16:05:44 - nx run data-access:test ✅        (36s later)
16:06:57 - nx run edit-content-bridge:test ✅ (73s later)
16:07:14 - nx run edit-ema-ui:test ✅        (17s later)
[Each project starts AFTER previous completes]
```

This suggests actual parallel execution doesn't happen, but the `parallel: 3` configuration may still affect Nx's internal process initialization or resource management.

## Affected PRs

**Heavily affected (all change core-web test files):**
- #34126 (34029-browser-component) - Changes many test files
- #34165 (issue-34166-usage-ui) - Changes test files
- #34168, #34179 - Change test files

**Pattern:** Any PR that changes core-web test files triggers `nx affected` to run frontend tests, which then cancels 100% of the time.

## Solutions

### Option 1: Revert Parallelism (Immediate, Recommended for Testing)

**Change:** `core-web/nx.json` line 7
```diff
-   "parallel": 3
+   "parallel": 1
```

**Why test this despite PR #34099 working:**
- Quick to validate (2-3 runs, issue is deterministic)
- Low risk, easy to revert if ineffective
- May help even if mechanism unclear
- If it works, unblocks PRs immediately

**Validation:**
- Create PR that touches any core-web test file
- Apply revert
- Test with 2-3 runs (not 20-30, since issue is deterministic)
- If Frontend Unit Tests succeed → Keep revert
- If still cancels → Try other solutions

**Pros:**
- Simple one-line change
- Easy to test and rollback
- May restore whatever made PR #34099 succeed

**Cons:**
- Doesn't explain why PR #34099 worked with `parallel: 3`
- Doesn't fix underlying test quality
- May not help if issue is GitHub infrastructure

### Option 2: Force Jest Exit (Bandaid, Not Recommended)

**Change:** Add to `core-web/jest.preset.js`:
```javascript
module.exports = {
    ...nxPreset,
    forceExit: true,  // Force Jest to exit after tests complete
    // ... existing config
};
```

**Pros:**
- Jobs complete successfully
- Keeps parallel: 3

**Cons:**
- Masks the problem completely
- Open handles remain undetected
- Technical debt accumulates
- May hide real issues
- Not addressing root cause

### Option 3: Configurable Parallelism (Long-term Enhancement)

**Allow different settings for local dev vs CI:**

```javascript
// core-web/nx.json
{
  "tasksRunnerOptions": {
    "default": {
      "options": {
        "parallel": process.env.CI ? 1 : 3  // 1 for CI, 3 for local
      }
    }
  }
}
```

**Benefits:**
- Developers get fast parallel execution locally
- CI gets stable sequential execution
- Can increase CI parallelism after tests are fixed
- Flexible for future adjustments

### Proper Fix (REQUIRED Regardless of Other Solutions)

**The underlying issue must be fixed even if revert works:**

```bash
cd core-web
nx affected -t test --base=main --detectOpenHandles
```

**For each identified test file:**
1. Add proper cleanup in `afterEach()` blocks
2. Clear timers: `clearTimeout()`, `clearInterval()`
3. Unsubscribe from RxJS: `.unsubscribe()`
4. Destroy Angular components properly
5. Clean up Zone.js async operations

**Example:**
```typescript
describe('MyComponent', () => {
  let timers: NodeJS.Timeout[] = [];
  let subscriptions: Subscription[] = [];

  afterEach(() => {
    timers.forEach(timer => clearTimeout(timer));
    timers = [];
    subscriptions.forEach(sub => sub.unsubscribe());
    subscriptions = [];
  });

  it('should work', () => {
    const timer = setTimeout(() => { ... }, 1000);
    timers.push(timer);

    const sub = observable$.subscribe(...);
    subscriptions.push(sub);
  });
});
```

**Why this is required:**
- Fixes actual technical debt
- Prevents issues even if GitHub behavior changes again
- Improves test reliability
- May indicate production issues
- Enables future parallelism if desired

## Recommended Action Plan

### Immediate (Today)
1. **Apply revert:** Change `parallel: 3 → 1` in `core-web/nx.json`
2. **Test on any PR that touches core-web tests** (only need 2-3 runs since deterministic)
3. **If successful:** Merge revert to unblock PRs immediately
4. **If unsuccessful:** Try Option 2 (forceExit) as emergency fallback while investigating

### Short-Term (This Week)
1. Run `nx affected -t test --base=main --detectOpenHandles` locally
2. Identify all tests with open handles
3. Create issues for fixing each leaking test file
4. Begin fixing highest-impact tests

### Medium-Term (Next Sprint)
1. Fix all identified leaking tests
2. Remove `forceExit` if it was used
3. Add ESLint rules for common leak patterns:
   - Detect uncleaned timers
   - Require cleanup in afterEach
   - Flag missing unsubscribe
4. Implement pre-commit hook with `--detectOpenHandles`
5. Document testing best practices

### Long-Term (Next Quarter)
1. Consider restoring `parallel: 3` ONLY after all tests are clean
2. Implement configurable parallelism (Option 3)
3. Add continuous monitoring for worker failures
4. Add CI checks to prevent regression
5. Monitor GitHub Actions for infrastructure changes

## Key Takeaways

1. **Issue is DETERMINISTIC:** Fails 100% when frontend tests run (after Dec 23), 0% when they don't
2. **Not probabilistic:** Only need 2-3 test runs to validate fixes, not 20-30
3. **Dec 15-22 "success":** PRs simply never triggered frontend tests via `nx affected`
4. **Dec 23+ failures:** PRs changed test files → triggered frontend tests → revealed issue
5. **PR #34099 paradox:** Had same worker failures but succeeded - suggests GitHub infrastructure changed
6. **Revert worth testing:** Quick validation even if we don't understand why PR #34099 worked
7. **Must fix root cause:** Regardless of revert effectiveness, underlying test quality must be addressed


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Frontend test cancellations caused by nx parallel: 3 change exposing uncleaned test resources #34194

Problem Summary

Expected vs Unexpected Cancellations

✅ EXPECTED: Concurrency Control Cancellation

❌ UNEXPECTED: Test Hang Cancellation (This Issue)

Root Cause Analysis

The Deterministic Pattern

Why Dec 15-22 Had "No Issues"

Breaking Change: PR #34099 (December 15, 2025)

Why PR #34099 Itself Didn't Fail

Underlying Technical Debt

Evidence

Example Cancelled Run: 20622139223 (Dec 31, 2025)

Job Presence Analysis

Sequential Execution Observation

Affected PRs

Solutions

Option 1: Revert Parallelism (Immediate, Recommended for Testing)

Option 2: Force Jest Exit (Bandaid, Not Recommended)

Option 3: Configurable Parallelism (Long-term Enhancement)

Proper Fix (REQUIRED Regardless of Other Solutions)

Recommended Action Plan

Immediate (Today)

Short-Term (This Week)

Medium-Term (Next Sprint)

Long-Term (Next Quarter)

Key Takeaways

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frontend test cancellations caused by nx parallel: 3 change exposing uncleaned test resources #34194

Description

Problem Summary

Expected vs Unexpected Cancellations

✅ EXPECTED: Concurrency Control Cancellation

❌ UNEXPECTED: Test Hang Cancellation (This Issue)

Root Cause Analysis

The Deterministic Pattern

Why Dec 15-22 Had "No Issues"

Breaking Change: PR #34099 (December 15, 2025)

Why PR #34099 Itself Didn't Fail

Underlying Technical Debt

Evidence

Example Cancelled Run: 20622139223 (Dec 31, 2025)

Job Presence Analysis

Sequential Execution Observation

Affected PRs

Solutions

Option 1: Revert Parallelism (Immediate, Recommended for Testing)

Option 2: Force Jest Exit (Bandaid, Not Recommended)

Option 3: Configurable Parallelism (Long-term Enhancement)

Proper Fix (REQUIRED Regardless of Other Solutions)

Recommended Action Plan

Immediate (Today)

Short-Term (This Week)

Medium-Term (Next Sprint)

Long-Term (Next Quarter)

Key Takeaways

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions