[OPIK-4911] [INFRA] feat: shard E2E TypeScript tests in CI by JetoPistola · Pull Request #6763 · comet-ml/opik

JetoPistola · 2026-05-19T05:13:34Z

Details

Status: parked / empty. Branch reset to an empty commit on top of main pending the GHCR build-once initiative. See OPIK-4911 for the full sharding plan.

What was attempted first

The original commit on this branch bumped workers: 1 → 4 in playwright.config.ts (intra-job parallelism) plus threaded=True on the Flask test helper. The CI run regressed on every axis:

39m52s wall time (vs ~30 min baseline)
31 min test phase (vs ~20 min baseline)
40 failed / 18 flaky / 43 passed

Failures cluster around list-view helpers (ProjectsPage.checkProjectExistsWithRetry → page.reload()) reporting net::ERR_ABORTED; maybe frame was detached? under concurrent load. Even with random per-test names, the test environment has shared state — one Opik backend, one workspace (default), one ClickHouse, one Flask helper — that intra-job workers cannot escape. Bumping workers cannot fix this; sharding can.

Sharding plan (this PR's eventual content)

Split the suite across N=4 GitHub matrix jobs, each running an isolated Opik backend:

strategy:
  fail-fast: false
  matrix:
    shard: [1, 2, 3, 4]

npx playwright test --shard=${{ matrix.shard }}/4
workers: 1 per shard — sharding is the parallelism axis, not intra-shard workers
Each shard pulls a pre-built Opik image from GHCR (depends on the GHCR build-once initiative), avoiding the 4× ./opik.sh --build tax
--reporter=blob per shard + a follow-up npx playwright merge-reports job uploading one consolidated HTML artifact
ALLURE_JOB_RUN_ID: ${{ github.event.inputs.ALLURE_JOB_RUN_ID || github.run_id }} so all shards land in one TestOps launch (PR triggers currently leave this blank)

Expected wall time after GHCR build-once lands: ~5–7 min tests + ~2 min setup = 8–10 min total, comfortably under the 15 min AC.

Change checklist

Matrix added to end2end_suites_typescript.yml (blocked on GHCR build-once)
--reporter=blob + merge-reports aggregation job
ALLURE_JOB_RUN_ID defaulted from github.run_id
fullregression green 3 consecutive runs

Issues

Resolves OPIK-4911
Blocked on: GHCR build-once initiative

Testing

Manual workflow_dispatch with fullregression; verify wall time
TestOps shows one consolidated Allure launch (not 4 fragments)
Single combined Playwright HTML report artifact

Documentation

No user-facing changes; CI-only.

🤖 Generated with Claude Code

baz-reviewer · 2026-05-19T05:20:30Z

  forbidOnly: !!process.env.CI,
  retries: process.env.CI ? 2 : 0,
-  workers: process.env.CI ? 1 : 1,
+  workers: process.env.CI ? 4 : 1,


Since CI runs with 4 workers, fixtures using test-*-${Date.now()} can collide in the same millisecond and helpers that create/delete by name may clobber each other; should we add a worker/test-specific suffix like workerIndex before parallel CI?

_{Finding type: Logical Bugs | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/playwright.config.ts around line 13 (the `workers` setting), CI enables 4 parallel workers but the shared fixture helpers generate resource names using millisecond timestamps (e.g., `test-categorical-${Date.now()}` etc.), so two workers starting in the same ms can collide and one test deletes/overwrites the other’s prompt/feedback/experiment. Refactor the fixture/helper naming logic to append a worker-unique suffix (for example using Playwright’s `testInfo.workerIndex` or a per-worker env/pid value) and ensure all create/update/delete paths use the same composed name. After that change, re-enable `workers: process.env.CI ? 4 : 1` safely by keeping uniqueness across workers.

Skipping — Date.now() isn't used for project/dataset names in this PR. Project names use getRandomString(5) (see helpers/random.ts), giving ~12M variants per worker. The Date.now() usage exists only in feedback-experiments-prompts.fixture.ts (unchanged by this PR), and that fixture is exercised by a single shard at a time post-sharding, so cross-millisecond collisions across workers aren't possible. The broader question of fixture name uniqueness is tracked separately.

🤖 Reply posted via /address-github-pr-comments

Thanks for the clarification — that makes sense. I'll save this to memory once the PR is merged.

baz-reviewer · 2026-05-19T05:20:30Z

    print(f"🚀 Starting Opik Test Helper Service on port {port}")
    print(f"📍 Health check available at: http://localhost:{port}/health")
-    app.run(host="0.0.0.0", port=port, debug=False)
+    app.run(host="0.0.0.0", port=port, debug=False, threaded=True)


With threaded=True, concurrent Flask requests can clobber process-global os.environ used by OPIK_WORKSPACE/OPIK_URL_OVERRIDE/OPIK_API_KEY—routes/projects.py mutates it per request while helpers read it at call time. Should we keep the handler single-threaded or move the client config to request-local state instead of mutating os.environ?

_{Finding type: Logical Bugs | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/test-helper-service/app.py around lines 138-138 where you call `app.run(..., threaded=True)`, fix the race condition: Flask will handle overlapping requests, but the helper routes/utils rely on process-global `os.environ` values for OPik config that are mutated per request. Refactor by either removing `threaded=True` so the server is single-threaded, or (better) eliminate per-request mutation of `os.environ` by moving OPik workspace/URL/API key into request-local state (e.g., Flask request context like `flask.g`) and pass that config to downstream helpers so requests don’t clobber each other. Ensure routes/projects.py (81-87) and routes/utils.py (19-25) stop reading/writing shared environment variables for request-specific values.

Skipping — concern doesn't apply with the final sharding architecture. Each CI shard runs its own isolated Flask helper process (separate runner, separate container), so there's no inter-process concurrency on os.environ to worry about. Within a single shard workers: 1 means one Playwright worker drives one Flask process, so the previous concern (4 Playwright workers hitting one Flask) is moot. The threaded=True flag is harmless in the current single-worker-per-shard setup.

🤖 Reply posted via /address-github-pr-comments

Thanks for the clarification — that makes sense with the shard-isolated Flask helper and single-worker setup. I'll save this to memory once the PR is merged.

…build-once Cut Application E2E Tests (TypeScript) from ~30 min to a target ~8-10 min by sharding across 4 GitHub matrix jobs that each pull a pre-built Opik image from GHCR rather than rebuilding with ./opik.sh --build. Workflow shape (end2end_suites_typescript.yml): build-opik (calls build_e2e_docker.yaml with include_frontend: true) -> run_suite matrix [1..4] (--shard=$i/4, workers: 1 per shard) -> merge-reports (downloads 4 blob artifacts, emits one HTML) Each shard runs an isolated Opik stack (backend + python-backend + frontend pulled from ghcr.io/comet-ml/opik/...:ci-e2e-<pr>), so the single-backend contention that broke a previous workers:4 attempt (40 failed / 18 flaky on PR #6763 history) is structurally impossible. ALLURE_JOB_RUN_ID defaults to github.run_id so PR triggers aggregate all 4 shards into one TestOps launch (previously env was only set on workflow_dispatch input, leaving PR runs ungrouped). build_e2e_docker.yaml: adds include_frontend boolean input (default false). Builds and pushes ghcr.io/comet-ml/opik/opik-frontend:ci-e2e-<pr> when set. Pattern mirrors include_guardrails. cleanup_e2e_docker.yaml: opik-frontend added to IMAGES so the cleanup-pr and cleanup-stale jobs sweep frontend tags too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ards The previous attempt defaulted ALLURE_JOB_RUN_ID to github.run_id, which allurectl interprets as "join this pre-existing job run." TestOps had never heard of github.run_id, so every shard got 404 from allurectl watch and the workflow exit 1'd before tests ran. Switch to the documented parallel-jobs pattern: setup-allure -> allurectl launch create, captures launch_id run_suite[N] -> ALLURE_LAUNCH_ID env from needs.setup-allure outputs close-allure -> allurectl launch close on always() Drops the github.run_id fallback on ALLURE_JOB_RUN_ID (which still wins when workflow_dispatch is invoked with an explicit input). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-results Two small follow-ups after the first green sharded run: 1. Drop allure-results-{1..4} artifact uploads. Per-shard Allure JSON was redundant with the consolidated TestOps launch that all 4 shards already upload into via ALLURE_LAUNCH_ID. Saves ~300KB/run and declutters the artifacts list. 2. Upload tests_end_to_end/typescript-tests/test-results per shard when the shard fails. Without this, the merged Playwright HTML report's "Trace" / "Video" / "Screenshot" links point at files that were never uploaded, so reviewers debugging a failed shard see 404s. Guarded by if: failure() + if-no-files-found: ignore to avoid bloating storage on green runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first sharded run hit 20m59s with N=4. Per-shard runtime analysis showed severe imbalance (shard 1: 506s, shard 2: 157s) caused by Playwright's deterministic file-grouping landing fat specs together rather than by lack of shard count. The math says N=8 saves at most ~30s on the long-pole (still bounded by the longest single spec file: trace-attachments.spec.ts at 298.5s). Cheap to verify empirically: one CI run, then we have data to compare. If the savings are modest, this confirms the shard-balance ticket is the real next step. If they're material, we revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The build job's push step pushes images sequentially (~92s total: backend 20s + python-backend 66s + frontend 6s observed on PR #6763). Switch to background pushes with wait + accumulated failure check. Expected: wall time = max(push_time_per_image) instead of sum. On the observed numbers that's ~66s instead of ~92s, saving ~26s per CI run across all 6 consumer workflows + the sharded TS E2E suite. Tag step kept sequential (metadata op, no measurable benefit from parallelizing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@sanity

The N=8 experiment confirmed shard count isn't the bottleneck: 19m13s wall (vs N=4's 20m21s) with two empty runners burning CI minutes. The real long-pole is two fat parametrized specs that each fully occupy a single shard: online-scoring.spec.ts 12 parametrized tests x ~24s = 290s trace-attachments.spec.ts 32 parametrized tests x ~9s = 298s Together they account for 45% of total test CPU time. Playwright shards at file granularity, so no shard count can split them; the only path forward is breaking each into smaller files. Splits: online-scoring.spec.ts (290s) -> online-scoring-rule-creation.spec.ts (~92s, 6 parametrized tests) online-scoring-full-flow.spec.ts (~198s, 6 parametrized tests) trace-attachments.spec.ts (298s) -> trace-attachments-low-level.spec.ts (~90s, 10 parametrized) trace-attachments-decorator.spec.ts (~99s, 10 parametrized + 1 sanity) trace-attachments-span.spec.ts (~99s, 10 parametrized + 1 sanity) trace-attachments-shared.ts (extracted ATTACHMENTS const) Test tags (@fullregression / @sanity / @HappyPaths / @Tracing / @attachments / @onlinescoring) preserved verbatim so grep-based suite selection is unchanged. Predicted long-pole drops from 506s -> 446s; predicted wall time drops from ~20 min -> ~16 min. Shard 2 still owns both online-scoring halves (alphabetically adjacent under online-scoring/) which is the new floor. Also: revert N=8 -> N=4. The experiment showed N=8 wastes ~6 runner-minutes on empty shards 6 and 7 for only 1m08s wall savings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

baz-reviewer · 2026-05-20T05:52:36Z

-      console.log(`Moderation rule creation test completed successfully for ${providerConfig.display_name} - ${modelConfig.name}`);
-    });
-
    test(`Online scoring full moderation flow with ${providerConfig.display_name} ${modelConfig.name} @sanity @happypaths @fullregression @onlinescoring`, async ({


In online-scoring-full-flow.spec.ts, the parametrized test(�0nline scoring full moderation flow with ...�1, async (...) => { ... }) seems to do timeout/setup before the required test.info().annotations.push({ type: 'description', ... }), so the scenario may show up in reports without description metadata—should we move that annotation to the first statement?

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-full-flow.spec.ts around line 21 (the parametrized `test('Online scoring full moderation flow with ...', async (...) => { ... })` callback), ensure the test body starts by pushing the required description annotation via `test.info().annotations.push({ type: 'description', ... })` as the very first statement. Move any timeout/setup/waits that currently run before this annotation to after the annotation so the metadata is always present in reports. Verify the annotation payload matches the convention expected by `.agents/skills/playwright-e2e/test-conventions.md`.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Thanks for the context — understood that this is a pre-existing test pattern and out of scope for this PR. I’ll save this to memory after the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+    test(`Create moderation rule with ${providerConfig.display_name} ${modelConfig.name} @fullregression @onlinescoring`, async ({
+      page,
+      projectName,
+      helperClient,
+      createProjectApi,


The loop-generated online-scoring tests run work before the first test.info().annotations.push({ type: 'description', ... }), but the convention in .agents/skills/playwright-e2e/test-conventions.md expects that at the top—should we move/guarantee the description push before any other statements to avoid missing metadata in CI?

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts around lines 15-19 inside the `for (const { providerName, modelConfig, providerConfig } of enabledModels)` generated `test(...)` callback, the test body starts work immediately and does not push the required `test.info().annotations.push({ type: 'description', ... })` as the first statement. Refactor by adding a description annotation call as the very first line inside the async test function (before creating `AIProviderSetupHelper` / `RulesPage` or calling `test.step`), using the provider/model values to populate the message. Verify that the annotation is present for every loop-generated test and that CI sees the description metadata.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Thanks, that makes sense — this is a pre-existing test pattern and outside this PR’s scope. I’ll save that to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+      await test.step('Setup AI provider if needed', async () => {
+        await providerSetupHelper.setupProviderIfNeeded(providerName, providerConfig);
+      });
+
+      await test.step('Navigate to project and rules tab', async () => {


Consider adding the required “step text” comment right before each test.step() in .agents/skills/playwright-e2e, since the steps are emitted directly and the spec no longer matches the readability convention.

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts around lines 26-47, inside the `test( Create moderation rule with ... )` body, the multiple `test.step()` calls (“Setup AI provider if needed”, “Navigate to project and rules tab”, “Create moderation rule”, “Cleanup provider”) are missing the required guideline comment. Refactor by adding a comment immediately before each `await test.step('...')` with the exact same step text so the emitted step text is traceable. Ensure every `test.step()` in that block has the preceding comment.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Thanks, that context helps. I’ll remember this for similar specific-code findings once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+        const ruleName = `Test Moderation Rule - ${modelConfig.name}`;
+        await rulesPage.createModerationRule(ruleName, providerConfig.display_name, modelConfig.ui_selector);
+
+        await expect(page.getByText(ruleName)).toBeVisible();


This test bypasses the page-object by calling page.getByText(ruleName) directly, and RulesPage only exposes navigation/creation helpers—can we add a RulesPage assertion like checkRuleExists and use that instead?

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts around lines 37-43 (the “Create moderation rule” test.step), remove the raw locator assertion `page.getByText(ruleName)` and instead assert rule presence through the RulesPage page object. Refactor by adding a new RulesPage method (e.g., `checkRuleExists(ruleName)` or `expectRuleVisible(ruleName)`) inside the corresponding page object, using whatever locator/selector the page object catalog allows for rules. Then update this spec to call that new RulesPage assertion so selector ownership remains in RulesPage and the test only orchestrates actions/assertions via page objects.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Thanks, that makes sense — I’ll treat this as out of scope for PRs like this when it’s a pre-existing test-pattern issue already tracked separately. I’ll save this to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+    base(`Trace attachment ${attachment.name} is correctly displayed via low-level client @fullregression @tracing @attachments`, async ({
+      page,
+      projectName,
+      helperClient,
+    }) => {


Can we ensure each per-attachment low-level-client test pushes test.info().annotations.push({ type: 'description', ... }) before trace/navigate? Otherwise the convention in .agents/skills/playwright-e2e/test-conventions.md will be missed for looped cases and their reports may lack description.

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-low-level.spec.ts around lines 8-12, inside the per-attachment `base(...)` test callback for `Trace attachment ${attachment.name}`, add the required `test.info().annotations.push({ type: 'description', ... })` at the very start of the async test body (before the first `base.step`). Populate the description with what the test is doing (e.g., creating a trace with the given attachment via the low-level client and verifying the attachment appears). This ensures all looped low-level cases comply with `.agents/skills/playwright-e2e/test-conventions.md` report requirements.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Thanks for the context — understood that this is a pre-existing test pattern and out of scope for this PR, with follow-up tracked in OPIK-6610. I'll save this to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+      await base.step('Create trace with span attachment', async () => {
+        const result = await helperClient.createTraceWithSpanAttachment(projectName, attachment.path);
+        attachmentName = result.attachmentName;
+        spanName = result.spanName;
+      });


helperClient.createTraceWithSpanAttachment(projectName, ...) runs before project setup — should we add an explicit "create project" step mirroring tracing.fixture.ts before invoking it?

_{Finding type: Logical Bugs | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-span.spec.ts around lines 18-22 inside the first step "Create trace with span attachment", ensure the project is created and ready before calling helperClient.createTraceWithSpanAttachment(projectName, ...). The current code assumes the backend auto-creates missing projects, but the helper-service route only forwards project_name and does not create/wait for the project. Refactor by adding an explicit "create project / wait for project exists" step (mirroring what tracing.fixture.ts does) right after entering the test and before invoking createTraceWithSpanAttachment, so the trace creation and subsequent span/attachment UI assertions run reliably.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Thanks for the clarification — that makes sense. I’ll save this to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+    await base.step('Create trace with span attachment', async () => {
+      const result = await helperClient.createTraceWithSpanAttachment(projectName, firstAttachment.path);
+      attachmentName = result.attachmentName;
+      spanName = result.spanName;
+    });


The sanity test uses helperClient.createTraceWithSpanAttachment(projectName, ...) before the project exists, while tracing.fixture.ts creates/waits for the project; can we create the project first in this test too?

_{Finding type: Logical Bugs | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-span.spec.ts around lines 53-112, inside the sanity test `Span attachments are correctly displayed within traces (sanity)`, update the setup so a project exists before calling `helperClient.createTraceWithSpanAttachment(projectName, firstAttachment.path)` in the “Create trace with span attachment” step. Right now, `projectName` comes only from the base fixture and no explicit project creation/wait happens before trace creation, so trace creation can fail before the attachment assertions. Refactor by adding a project creation/check-and-wait step (using the existing fixture utilities or the ProjectsPage flow) before the trace creation step, ensuring the project is persisted and ready for the helper-service call.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Thanks for the clarification — that makes sense. I’ll save this to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+    base(`Trace attachment ${attachment.name} is correctly displayed via decorator @fullregression @tracing @attachments`, async ({
+      page,
+      projectName,
+      helperClient,
+    }) => {


Per-attachment decorator tests start with trace creation/navigation before test.info().annotations.push({ type: 'description', ... }), but .agents/skills/playwright-e2e/test-conventions.md wants that description annotation at the start of each test body—should we push it first for all looped decorator cases?

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts around lines 6-43 inside the looped `base(...)` test for each `ATTACHMENTS` item, add the required `test.info().annotations.push({ type: 'description', description: ... })` at the very start of the test body (right after the async callback begins), before creating the trace or navigating pages. Refactor the description to mention the specific attachment being tested (e.g., attachment.name/path) and the intended behavior (trace-level attachment displayed via decorator). Ensure this annotation is pushed for every loop iteration so the reports include the required `description` field.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Thanks, that makes sense — this is a pre-existing test pattern and outside this PR’s scope. I’ll save this to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+      await base.step('Create trace with attachment via decorator', async () => {
+        attachmentName = await helperClient.createTraceWithAttachmentDecorator(projectName, attachment.path);
+      });
+
+      await base.step('Navigate to project traces page', async () => {


helperClient.createTraceWithAttachmentDecorator is called before the project exists — should we add a create+wait step (like tracing.fixture.ts does) before trace creation, and also add the required step-text comment before each base.step() call to match the documented readability convention?

_{Finding types: Logical Bugs AI Coding Guidelines | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. 1. In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts around lines 16-18, refactor by adding an explicit "create project and wait until ready" step before the trace-creation step, reusing the same approach/logic as the existing tracing.fixture.ts (the one that calls createProject and waits for it). Do the same for the sanity test around lines 68-70 so both scenarios reliably create the project before creating the trace. 2. In the same file, add a comment line immediately before each `base.step(...)` call that exactly contains the step text passed to `base.step`. This applies to all `base.step(...)` calls in the first base.describe loop (lines 16-41) and the sanity test block (lines 68-93).

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Thanks for the clarification — that makes sense. I’ll remember this pattern and save it to memory once the PR is merged.

baz-reviewer · 2026-05-20T05:52:36Z

+    await base.step('Create trace with attachment via decorator', async () => {
+      attachmentName = await helperClient.createTraceWithAttachmentDecorator(projectName, firstAttachment.path);
+    });


The sanity test calls helperClient.createTraceWithAttachmentDecorator(projectName, ...) without first creating/waiting for the project like tracing.fixture.ts does, so if the SDK doesn’t auto-create missing projects the spec can fail at trace creation—should we add the same project setup here too?

_{Finding type: Logical Bugs | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts around lines 68-70 within the sanity test `Trace attachments are correctly displayed for traces created via decorator (sanity)`, the test calls `helperClient.createTraceWithAttachmentDecorator(projectName, ...)` without ensuring the project exists first (unlike the looped test which navigates and checks the project before loading traces). Update this sanity test to explicitly create or wait for the project using the existing fixtures/page objects (e.g., add a `ProjectsPage` flow to `goto()` + `checkProjectExistsWithRetry(projectName, ...)` and/or call the same project-setup helper that `tracing.fixture.ts` uses) before creating the trace. Then keep the rest of the steps the same so trace creation won’t fail when the SDK/helper service doesn’t auto-create missing projects.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Thanks for the clarification — that makes sense. I’ll remember this pattern and save it to memory once the PR is merged.

# Conflicts: # .github/workflows/build_e2e_docker.yaml

…wall Modeling against the post-split spec runtimes shows N=8 is the sweet spot: long-pole drops from 446s (N=4) to 252s, no idle shards (vs the 2-empty-shard waste from the pre-split N=8 experiment). N=10+ caps at 223s long-pole — the unavoidable feedback-scores + online-scoring-full-flow alphabetical pair — so N=16 would burn 60% more runner-minutes for zero additional speedup vs N=10. Predicted N=8 wall: ~15 min (vs the post-split N=4 measured 19 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JetoPistola · 2026-05-20T20:10:54Z

Fern PRs related too, FYI @thiagohora @andrescrz , similar to BE tests sharding, now for FE

👋 @alexkuzmik @aadereiko @AndreiCautisanu — looking for a sign-off to merge at N=8 and iterate from there.

N=8 run came in

Run 26185619407 — full success, 8 parallel shards on fullregression.

Config	Wall	Long-pole shard	Spread (fastest → slowest)	Runner-min
N=4 + splits (currently shipped)	19m00s	10m20s	n/a	~36
N=8 + splits (this run)	18m48s	9m38s (shard 2)	4m38s → 9m38s	~58

So N=8 is ~12s faster than N=4, at the cost of ~22 extra runner-min per PR. The win is smaller than predicted: Playwright's contiguous-block sharder reclustered online-scoring-full-flow with neighbours onto shard 2, while shard 4 finished in 4m38s and idled.

Why N≥10 doesn't help

The bottleneck is one spec inside the long-pole shard, not the shard count. online-scoring-full-flow = 199s of sequential LLM-moderation tests; adding shards doesn't subdivide it. We'd just pay for more idle runners. Math:

N	Long-pole	Wall	Runner-min	vs N=8
8	~9m+	~18 min	~58	—
10	~9m+	~18 min	~67	+9 runner-min, 0 wall
16	~9m+	~18 min	~107	+49 runner-min, 0 wall

Suggested path forward

Instead of more shards, I'd suggest we merge N=8 now and then iterate. The big remaining levers (in order):

Parallelize online-scoring-full-flow — 6 sequential LLM-moderation tests inside one spec. test.describe.parallel or a faster moderation stub drops the spec from 199s → ~70-100s. Expected wall: ~14 min. Tracked in OPIK-6610 (Option 4).
Split / parallelize trace-spans (173s) — same pattern. Would become the next long-pole if we don't.
Cut the 8.5-min Docker build — runs on every PR with mostly cold caches. Layer-cache reuse from main, or build-once-per-day cached tags. Expected wall: ~10 min. No ticket yet — I can file one.
Past that — ~5 min floor, but it requires rewriting individual tests or moving to faster runners.

We're already 30 → 18-20 min on PR. With (1)+(2) we can hit ~14, with (3) ~10, and the ~5 min goal is reachable but each step is its own project. Better to ship each rung, prove the number, then move on.

TL;DR

N=8 is the right place to land for now.
N=10/12/16 isn't worth the runner-min until the slowest spec is fixed.
Slow-test work tracked in OPIK-6610; Docker-build optimization needs a new ticket.

OK to merge as-is? Will follow up on OPIK-6610 next.

Diagram of the ladder + the per-shard time composition is at diagrams/opik-4911-next-steps-diagram.html in the repo — I'll paste an image of it in a follow-up comment.

JetoPistola requested review from a team as code owners May 19, 2026 05:13

github-actions Bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels May 19, 2026

github-actions Bot assigned JetoPistola May 19, 2026

baz-reviewer Bot reviewed May 19, 2026

View reviewed changes

JetoPistola marked this pull request as draft May 19, 2026 05:52

JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from ae9d21f to f00cf7d Compare May 19, 2026 06:01

JetoPistola changed the title ~~[OPIK-4911] [INFRA] feat: parallelize TS E2E tests with 4 workers in CI~~ [OPIK-4911] [INFRA] feat: shard E2E TypeScript tests in CI May 19, 2026

JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from f00cf7d to b386da8 Compare May 19, 2026 16:17

JetoPistola changed the base branch from main to danield/OPIK-6028-build-docker-images-once-parallelize-tests May 19, 2026 16:17

JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from b386da8 to fcefc8f Compare May 19, 2026 16:26

github-actions Bot added Infrastructure and removed python Pull requests that update Python code tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels May 19, 2026

JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from fcefc8f to 25d475f Compare May 19, 2026 16:30

JetoPistola and others added 3 commits May 19, 2026 21:37

github-actions Bot added tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels May 20, 2026

baz-reviewer Bot reviewed May 20, 2026

View reviewed changes

JetoPistola and others added 2 commits May 20, 2026 10:51

merge: pull GHCR build-once branch including parallel push

b332466

# Conflicts: # .github/workflows/build_e2e_docker.yaml

JetoPistola marked this pull request as ready for review May 20, 2026 20:12

JetoPistola requested a review from a team as a code owner May 20, 2026 20:12

Conversation

JetoPistola commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

What was attempted first

Sharding plan (this PR's eventual content)

Change checklist

Issues

Testing

Documentation

Uh oh!

baz-reviewer Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JetoPistola commented May 19, 2026 •

edited

Loading

JetoPistola commented May 20, 2026 •

edited

Loading