Skip to content

Deflake CI tests#2139

Open
cconstable wants to merge 4 commits into
mainfrom
test-deflaking-the-sequel
Open

Deflake CI tests#2139
cconstable wants to merge 4 commits into
mainfrom
test-deflaking-the-sequel

Conversation

@cconstable

@cconstable cconstable commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

tl;dr the way we concurrently run tests can intermittently OOM runners and we have tests that compare stack traces (which can change).

What changed

  1. Split test-integration-workflows.ts (55 tests) into 6 themed files, dropping peak within-file concurrency from 53 to ≤12:

    • test-integration-cancellation-scopes.ts (12)
    • test-integration-activities.ts (9)
    • test-integration-workflow-start.ts (9)
    • test-integration-workflow-info.ts (12)
    • test-integration-replay-and-flags.ts (8)
    • test-integration-reserved-prefixes.ts (5)

    Shared workflow/activity/interceptor definitions moved to test-integration-workflows-common.ts, which each file re-exports (export *) so every file's bundle stays identical to the
    original (workflowsPath: __filename + dynamic interceptor dispatch unchanged).

  2. Heap usage guardrail: threaded-vm.ts reads TEMPORAL_WORKER_THREAD_MAX_HEAP_MB and applies it as the worker thread's resourceLimits.maxOldGenerationSizeMb, so a
    single runaway Workflow fails fast with a clear error instead of dragging the process into memory pressure. CI sets it to 1024 for Node jobs; unset (no cap) in production.

  3. Stack trace tests now ignore the "header" line of the stack trace since different engines output different lines.

More details

Integration-test CI jobs intermittently failed with ERR_WORKER_OUT_OF_MEMORY / AVA timeouts (and a native Bun crash) even when assertions passed. Root cause: AVA's concurrency: 1
only serializes test files — within a file all non-serial tests run at once. test-integration-workflows.ts ran 53 tests concurrently, each spinning up its own Worker (worker
thread + V8 context), exhausting memory.

Profiling revealed that peak RSS scales ~linearly with concurrently-live Workers (~47 MB each; ~334 MB at 1, ~1.8 GB at 32, with trivial workflows).

Verification

All 55 tests pass locally across the 6 files (including the time-skipping/interceptor test). eslint/prettier clean; @temporalio/worker and @temporalio/test build.

@cconstable cconstable force-pushed the test-deflaking-the-sequel branch from e6db15b to b3e9d60 Compare June 24, 2026 20:15
@cconstable cconstable force-pushed the test-deflaking-the-sequel branch from b3e9d60 to 5ae34a2 Compare June 24, 2026 20:29
@cconstable cconstable changed the title De-flaking tests by relieving memory pressure Deflake CI tests Jun 24, 2026
@cconstable cconstable marked this pull request as ready for review June 24, 2026 21:33
@cconstable cconstable requested a review from a team as a code owner June 24, 2026 21:33

@chris-olszewski chris-olszewski left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test splitting is great, unsure about the other 2 changes.

registeredActivityNames,
logger,
}: ThreadedVMWorkflowCreatorOptions): Promise<ThreadedVMWorkflowCreator> {
const maxHeapMb = Number(process.env.TEMPORAL_WORKER_THREAD_MAX_HEAP_MB);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little hesitant about having a public facing escape hatch. People/AI can/will find and use this. I would prefer if we keep this PR focused on testing/CI and not alter published packages.

Comment on lines +50 to +55
* The header line of a stack trace (`<ErrorName>: <message>`) is the most engine- and version-dependent
* part (V8, JSC and Deno all render it differently; JSC may even drop the name and message), and it is
* redundant with the callers' own `instanceof`/`message` assertions. We therefore match it loosely and
* only assert the meaningful call frames. This applies only to multi-line stacks; single-line values
* (e.g. a function name compared against `$CLASS.all`) are matched as-is.
*

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super comfortable with making these assertions less strict. They have caught regressions in the past. We test against a set matrix of these engines, is there a specific one that was flaking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants