[OPIK-4911] [INFRA] feat: shard E2E TypeScript tests in CI#6763
Conversation
| forbidOnly: !!process.env.CI, | ||
| retries: process.env.CI ? 2 : 0, | ||
| workers: process.env.CI ? 1 : 1, | ||
| workers: process.env.CI ? 4 : 1, |
There was a problem hiding this comment.
Since CI runs with 4 workers, fixtures using test-*-${Date.now()} can collide in the same millisecond and helpers that create/delete by name may clobber each other; should we add a worker/test-specific suffix like workerIndex before parallel CI?
Finding type: Logical Bugs | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/playwright.config.ts around line 13 (the `workers`
setting), CI enables 4 parallel workers but the shared fixture helpers generate resource
names using millisecond timestamps (e.g., `test-categorical-${Date.now()}` etc.), so two
workers starting in the same ms can collide and one test deletes/overwrites the
other’s prompt/feedback/experiment. Refactor the fixture/helper naming logic to append
a worker-unique suffix (for example using Playwright’s `testInfo.workerIndex` or a
per-worker env/pid value) and ensure all create/update/delete paths use the same
composed name. After that change, re-enable `workers: process.env.CI ? 4 : 1` safely by
keeping uniqueness across workers.
There was a problem hiding this comment.
Skipping — Date.now() isn't used for project/dataset names in this PR. Project names use getRandomString(5) (see helpers/random.ts), giving ~12M variants per worker. The Date.now() usage exists only in feedback-experiments-prompts.fixture.ts (unchanged by this PR), and that fixture is exercised by a single shard at a time post-sharding, so cross-millisecond collisions across workers aren't possible. The broader question of fixture name uniqueness is tracked separately.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the clarification — that makes sense. I'll save this to memory once the PR is merged.
| print(f"🚀 Starting Opik Test Helper Service on port {port}") | ||
| print(f"📍 Health check available at: http://localhost:{port}/health") | ||
| app.run(host="0.0.0.0", port=port, debug=False) | ||
| app.run(host="0.0.0.0", port=port, debug=False, threaded=True) |
There was a problem hiding this comment.
With threaded=True, concurrent Flask requests can clobber process-global os.environ used by OPIK_WORKSPACE/OPIK_URL_OVERRIDE/OPIK_API_KEY—routes/projects.py mutates it per request while helpers read it at call time. Should we keep the handler single-threaded or move the client config to request-local state instead of mutating os.environ?
Finding type: Logical Bugs | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/test-helper-service/app.py around lines 138-138 where you call
`app.run(..., threaded=True)`, fix the race condition: Flask will handle overlapping
requests, but the helper routes/utils rely on process-global `os.environ` values for
OPik config that are mutated per request. Refactor by either removing `threaded=True` so
the server is single-threaded, or (better) eliminate per-request mutation of
`os.environ` by moving OPik workspace/URL/API key into request-local state (e.g., Flask
request context like `flask.g`) and pass that config to downstream helpers so requests
don’t clobber each other. Ensure routes/projects.py (81-87) and routes/utils.py
(19-25) stop reading/writing shared environment variables for request-specific values.
There was a problem hiding this comment.
Skipping — concern doesn't apply with the final sharding architecture. Each CI shard runs its own isolated Flask helper process (separate runner, separate container), so there's no inter-process concurrency on os.environ to worry about. Within a single shard workers: 1 means one Playwright worker drives one Flask process, so the previous concern (4 Playwright workers hitting one Flask) is moot. The threaded=True flag is harmless in the current single-worker-per-shard setup.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the clarification — that makes sense with the shard-isolated Flask helper and single-worker setup. I'll save this to memory once the PR is merged.
ae9d21f to
f00cf7d
Compare
f00cf7d to
b386da8
Compare
…build-once
Cut Application E2E Tests (TypeScript) from ~30 min to a target ~8-10 min
by sharding across 4 GitHub matrix jobs that each pull a pre-built Opik
image from GHCR rather than rebuilding with ./opik.sh --build.
Workflow shape (end2end_suites_typescript.yml):
build-opik (calls build_e2e_docker.yaml with include_frontend: true)
-> run_suite matrix [1..4] (--shard=$i/4, workers: 1 per shard)
-> merge-reports (downloads 4 blob artifacts, emits one HTML)
Each shard runs an isolated Opik stack (backend + python-backend +
frontend pulled from ghcr.io/comet-ml/opik/...:ci-e2e-<pr>), so the
single-backend contention that broke a previous workers:4 attempt
(40 failed / 18 flaky on PR #6763 history) is structurally impossible.
ALLURE_JOB_RUN_ID defaults to github.run_id so PR triggers aggregate
all 4 shards into one TestOps launch (previously env was only set on
workflow_dispatch input, leaving PR runs ungrouped).
build_e2e_docker.yaml: adds include_frontend boolean input (default
false). Builds and pushes ghcr.io/comet-ml/opik/opik-frontend:ci-e2e-<pr>
when set. Pattern mirrors include_guardrails.
cleanup_e2e_docker.yaml: opik-frontend added to IMAGES so the
cleanup-pr and cleanup-stale jobs sweep frontend tags too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b386da8 to
fcefc8f
Compare
…build-once
Cut Application E2E Tests (TypeScript) from ~30 min to a target ~8-10 min
by sharding across 4 GitHub matrix jobs that each pull a pre-built Opik
image from GHCR rather than rebuilding with ./opik.sh --build.
Workflow shape (end2end_suites_typescript.yml):
build-opik (calls build_e2e_docker.yaml with include_frontend: true)
-> run_suite matrix [1..4] (--shard=$i/4, workers: 1 per shard)
-> merge-reports (downloads 4 blob artifacts, emits one HTML)
Each shard runs an isolated Opik stack (backend + python-backend +
frontend pulled from ghcr.io/comet-ml/opik/...:ci-e2e-<pr>), so the
single-backend contention that broke a previous workers:4 attempt
(40 failed / 18 flaky on PR #6763 history) is structurally impossible.
ALLURE_JOB_RUN_ID defaults to github.run_id so PR triggers aggregate
all 4 shards into one TestOps launch (previously env was only set on
workflow_dispatch input, leaving PR runs ungrouped).
build_e2e_docker.yaml: adds include_frontend boolean input (default
false). Builds and pushes ghcr.io/comet-ml/opik/opik-frontend:ci-e2e-<pr>
when set. Pattern mirrors include_guardrails.
cleanup_e2e_docker.yaml: opik-frontend added to IMAGES so the
cleanup-pr and cleanup-stale jobs sweep frontend tags too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fcefc8f to
25d475f
Compare
…ards The previous attempt defaulted ALLURE_JOB_RUN_ID to github.run_id, which allurectl interprets as "join this pre-existing job run." TestOps had never heard of github.run_id, so every shard got 404 from allurectl watch and the workflow exit 1'd before tests ran. Switch to the documented parallel-jobs pattern: setup-allure -> allurectl launch create, captures launch_id run_suite[N] -> ALLURE_LAUNCH_ID env from needs.setup-allure outputs close-allure -> allurectl launch close on always() Drops the github.run_id fallback on ALLURE_JOB_RUN_ID (which still wins when workflow_dispatch is invoked with an explicit input). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-results
Two small follow-ups after the first green sharded run:
1. Drop allure-results-{1..4} artifact uploads. Per-shard Allure JSON
was redundant with the consolidated TestOps launch that all 4 shards
already upload into via ALLURE_LAUNCH_ID. Saves ~300KB/run and
declutters the artifacts list.
2. Upload tests_end_to_end/typescript-tests/test-results per shard when
the shard fails. Without this, the merged Playwright HTML report's
"Trace" / "Video" / "Screenshot" links point at files that were
never uploaded, so reviewers debugging a failed shard see 404s.
Guarded by if: failure() + if-no-files-found: ignore to avoid
bloating storage on green runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first sharded run hit 20m59s with N=4. Per-shard runtime analysis showed severe imbalance (shard 1: 506s, shard 2: 157s) caused by Playwright's deterministic file-grouping landing fat specs together rather than by lack of shard count. The math says N=8 saves at most ~30s on the long-pole (still bounded by the longest single spec file: trace-attachments.spec.ts at 298.5s). Cheap to verify empirically: one CI run, then we have data to compare. If the savings are modest, this confirms the shard-balance ticket is the real next step. If they're material, we revisit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The build job's push step pushes images sequentially (~92s total: backend 20s + python-backend 66s + frontend 6s observed on PR #6763). Switch to background pushes with wait + accumulated failure check. Expected: wall time = max(push_time_per_image) instead of sum. On the observed numbers that's ~66s instead of ~92s, saving ~26s per CI run across all 6 consumer workflows + the sharded TS E2E suite. Tag step kept sequential (metadata op, no measurable benefit from parallelizing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The N=8 experiment confirmed shard count isn't the bottleneck: 19m13s
wall (vs N=4's 20m21s) with two empty runners burning CI minutes. The
real long-pole is two fat parametrized specs that each fully occupy a
single shard:
online-scoring.spec.ts 12 parametrized tests x ~24s = 290s
trace-attachments.spec.ts 32 parametrized tests x ~9s = 298s
Together they account for 45% of total test CPU time. Playwright shards
at file granularity, so no shard count can split them; the only path
forward is breaking each into smaller files.
Splits:
online-scoring.spec.ts (290s) ->
online-scoring-rule-creation.spec.ts (~92s, 6 parametrized tests)
online-scoring-full-flow.spec.ts (~198s, 6 parametrized tests)
trace-attachments.spec.ts (298s) ->
trace-attachments-low-level.spec.ts (~90s, 10 parametrized)
trace-attachments-decorator.spec.ts (~99s, 10 parametrized + 1 sanity)
trace-attachments-span.spec.ts (~99s, 10 parametrized + 1 sanity)
trace-attachments-shared.ts (extracted ATTACHMENTS const)
Test tags (@fullregression / @sanity / @HappyPaths / @Tracing /
@attachments / @onlinescoring) preserved verbatim so grep-based suite
selection is unchanged.
Predicted long-pole drops from 506s -> 446s; predicted wall time drops
from ~20 min -> ~16 min. Shard 2 still owns both online-scoring halves
(alphabetically adjacent under online-scoring/) which is the new floor.
Also: revert N=8 -> N=4. The experiment showed N=8 wastes ~6
runner-minutes on empty shards 6 and 7 for only 1m08s wall savings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| console.log(`Moderation rule creation test completed successfully for ${providerConfig.display_name} - ${modelConfig.name}`); | ||
| }); | ||
|
|
||
| test(`Online scoring full moderation flow with ${providerConfig.display_name} ${modelConfig.name} @sanity @happypaths @fullregression @onlinescoring`, async ({ |
There was a problem hiding this comment.
In online-scoring-full-flow.spec.ts, the parametrized test(�0nline scoring full moderation flow with ...�1, async (...) => { ... }) seems to do timeout/setup before the required test.info().annotations.push({ type: 'description', ... }), so the scenario may show up in reports without description metadata—should we move that annotation to the first statement?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-full-flow.spec.ts
around line 21 (the parametrized `test('Online scoring full moderation flow with ...',
async (...) => { ... })` callback), ensure the test body starts by pushing the required
description annotation via `test.info().annotations.push({ type: 'description', ... })`
as the very first statement. Move any timeout/setup/waits that currently run before this
annotation to after the annotation so the metadata is always present in reports. Verify
the annotation payload matches the convention expected by
`.agents/skills/playwright-e2e/test-conventions.md`.
There was a problem hiding this comment.
Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the context — understood that this is a pre-existing test pattern and out of scope for this PR. I’ll save this to memory after the PR is merged.
| test(`Create moderation rule with ${providerConfig.display_name} ${modelConfig.name} @fullregression @onlinescoring`, async ({ | ||
| page, | ||
| projectName, | ||
| helperClient, | ||
| createProjectApi, |
There was a problem hiding this comment.
The loop-generated online-scoring tests run work before the first test.info().annotations.push({ type: 'description', ... }), but the convention in .agents/skills/playwright-e2e/test-conventions.md expects that at the top—should we move/guarantee the description push before any other statements to avoid missing metadata in CI?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts
around lines 15-19 inside the `for (const { providerName, modelConfig, providerConfig }
of enabledModels)` generated `test(...)` callback, the test body starts work immediately
and does not push the required `test.info().annotations.push({ type: 'description', ...
})` as the first statement. Refactor by adding a description annotation call as the very
first line inside the async test function (before creating `AIProviderSetupHelper` /
`RulesPage` or calling `test.step`), using the provider/model values to populate the
message. Verify that the annotation is present for every loop-generated test and that CI
sees the description metadata.
There was a problem hiding this comment.
Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks, that makes sense — this is a pre-existing test pattern and outside this PR’s scope. I’ll save that to memory once the PR is merged.
| await test.step('Setup AI provider if needed', async () => { | ||
| await providerSetupHelper.setupProviderIfNeeded(providerName, providerConfig); | ||
| }); | ||
|
|
||
| await test.step('Navigate to project and rules tab', async () => { |
There was a problem hiding this comment.
Consider adding the required “step text” comment right before each test.step() in .agents/skills/playwright-e2e, since the steps are emitted directly and the spec no longer matches the readability convention.
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts
around lines 26-47, inside the `test( Create moderation rule with ... )` body, the
multiple `test.step()` calls (“Setup AI provider if needed”, “Navigate to project
and rules tab”, “Create moderation rule”, “Cleanup provider”) are missing the
required guideline comment. Refactor by adding a comment immediately before each `await
test.step('...')` with the exact same step text so the emitted step text is traceable.
Ensure every `test.step()` in that block has the preceding comment.
There was a problem hiding this comment.
Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks, that context helps. I’ll remember this for similar specific-code findings once the PR is merged.
| const ruleName = `Test Moderation Rule - ${modelConfig.name}`; | ||
| await rulesPage.createModerationRule(ruleName, providerConfig.display_name, modelConfig.ui_selector); | ||
|
|
||
| await expect(page.getByText(ruleName)).toBeVisible(); |
There was a problem hiding this comment.
This test bypasses the page-object by calling page.getByText(ruleName) directly, and RulesPage only exposes navigation/creation helpers—can we add a RulesPage assertion like checkRuleExists and use that instead?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts
around lines 37-43 (the “Create moderation rule” test.step), remove the raw locator
assertion `page.getByText(ruleName)` and instead assert rule presence through the
RulesPage page object. Refactor by adding a new RulesPage method (e.g.,
`checkRuleExists(ruleName)` or `expectRuleVisible(ruleName)`) inside the corresponding
page object, using whatever locator/selector the page object catalog allows for rules.
Then update this spec to call that new RulesPage assertion so selector ownership remains
in RulesPage and the test only orchestrates actions/assertions via page objects.
There was a problem hiding this comment.
Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks, that makes sense — I’ll treat this as out of scope for PRs like this when it’s a pre-existing test-pattern issue already tracked separately. I’ll save this to memory once the PR is merged.
| base(`Trace attachment ${attachment.name} is correctly displayed via low-level client @fullregression @tracing @attachments`, async ({ | ||
| page, | ||
| projectName, | ||
| helperClient, | ||
| }) => { |
There was a problem hiding this comment.
Can we ensure each per-attachment low-level-client test pushes test.info().annotations.push({ type: 'description', ... }) before trace/navigate? Otherwise the convention in .agents/skills/playwright-e2e/test-conventions.md will be missed for looped cases and their reports may lack description.
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-low-level.spec.ts
around lines 8-12, inside the per-attachment `base(...)` test callback for `Trace
attachment ${attachment.name}`, add the required `test.info().annotations.push({ type:
'description', ... })` at the very start of the async test body (before the first
`base.step`). Populate the description with what the test is doing (e.g., creating a
trace with the given attachment via the low-level client and verifying the attachment
appears). This ensures all looped low-level cases comply with
`.agents/skills/playwright-e2e/test-conventions.md` report requirements.
There was a problem hiding this comment.
Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the context — understood that this is a pre-existing test pattern and out of scope for this PR, with follow-up tracked in OPIK-6610. I'll save this to memory once the PR is merged.
| await base.step('Create trace with span attachment', async () => { | ||
| const result = await helperClient.createTraceWithSpanAttachment(projectName, attachment.path); | ||
| attachmentName = result.attachmentName; | ||
| spanName = result.spanName; | ||
| }); |
There was a problem hiding this comment.
helperClient.createTraceWithSpanAttachment(projectName, ...) runs before project setup — should we add an explicit "create project" step mirroring tracing.fixture.ts before invoking it?
Finding type: Logical Bugs | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-span.spec.ts around
lines 18-22 inside the first step "Create trace with span attachment", ensure the
project is created and ready before calling
helperClient.createTraceWithSpanAttachment(projectName, ...). The current code assumes
the backend auto-creates missing projects, but the helper-service route only forwards
project_name and does not create/wait for the project. Refactor by adding an explicit
"create project / wait for project exists" step (mirroring what tracing.fixture.ts does)
right after entering the test and before invoking createTraceWithSpanAttachment, so the
trace creation and subsequent span/attachment UI assertions run reliably.
There was a problem hiding this comment.
Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the clarification — that makes sense. I’ll save this to memory once the PR is merged.
| await base.step('Create trace with span attachment', async () => { | ||
| const result = await helperClient.createTraceWithSpanAttachment(projectName, firstAttachment.path); | ||
| attachmentName = result.attachmentName; | ||
| spanName = result.spanName; | ||
| }); |
There was a problem hiding this comment.
The sanity test uses helperClient.createTraceWithSpanAttachment(projectName, ...) before the project exists, while tracing.fixture.ts creates/waits for the project; can we create the project first in this test too?
Finding type: Logical Bugs | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-span.spec.ts around
lines 53-112, inside the sanity test `Span attachments are correctly displayed within
traces (sanity)`, update the setup so a project exists before calling
`helperClient.createTraceWithSpanAttachment(projectName, firstAttachment.path)` in the
“Create trace with span attachment” step. Right now, `projectName` comes only from
the base fixture and no explicit project creation/wait happens before trace creation, so
trace creation can fail before the attachment assertions. Refactor by adding a project
creation/check-and-wait step (using the existing fixture utilities or the ProjectsPage
flow) before the trace creation step, ensuring the project is persisted and ready for
the helper-service call.
There was a problem hiding this comment.
Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the clarification — that makes sense. I’ll save this to memory once the PR is merged.
| base(`Trace attachment ${attachment.name} is correctly displayed via decorator @fullregression @tracing @attachments`, async ({ | ||
| page, | ||
| projectName, | ||
| helperClient, | ||
| }) => { |
There was a problem hiding this comment.
Per-attachment decorator tests start with trace creation/navigation before test.info().annotations.push({ type: 'description', ... }), but .agents/skills/playwright-e2e/test-conventions.md wants that description annotation at the start of each test body—should we push it first for all looped decorator cases?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts
around lines 6-43 inside the looped `base(...)` test for each `ATTACHMENTS` item, add
the required `test.info().annotations.push({ type: 'description', description: ... })`
at the very start of the test body (right after the async callback begins), before
creating the trace or navigating pages. Refactor the description to mention the specific
attachment being tested (e.g., attachment.name/path) and the intended behavior
(trace-level attachment displayed via decorator). Ensure this annotation is pushed for
every loop iteration so the reports include the required `description` field.
There was a problem hiding this comment.
Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks, that makes sense — this is a pre-existing test pattern and outside this PR’s scope. I’ll save this to memory once the PR is merged.
| await base.step('Create trace with attachment via decorator', async () => { | ||
| attachmentName = await helperClient.createTraceWithAttachmentDecorator(projectName, attachment.path); | ||
| }); | ||
|
|
||
| await base.step('Navigate to project traces page', async () => { |
There was a problem hiding this comment.
helperClient.createTraceWithAttachmentDecorator is called before the project exists — should we add a create+wait step (like tracing.fixture.ts does) before trace creation, and also add the required step-text comment before each base.step() call to match the documented readability convention?
Finding types: Logical Bugs AI Coding Guidelines | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. 1. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts
around lines 16-18, refactor by adding an explicit "create project and wait until ready"
step before the trace-creation step, reusing the same approach/logic as the existing
tracing.fixture.ts (the one that calls createProject and waits for it). Do the same for
the sanity test around lines 68-70 so both scenarios reliably create the project before
creating the trace. 2. In the same file, add a comment line immediately before each
`base.step(...)` call that exactly contains the step text passed to `base.step`. This
applies to all `base.step(...)` calls in the first base.describe loop (lines 16-41) and
the sanity test block (lines 68-93).
There was a problem hiding this comment.
Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the clarification — that makes sense. I’ll remember this pattern and save it to memory once the PR is merged.
| await base.step('Create trace with attachment via decorator', async () => { | ||
| attachmentName = await helperClient.createTraceWithAttachmentDecorator(projectName, firstAttachment.path); | ||
| }); |
There was a problem hiding this comment.
The sanity test calls helperClient.createTraceWithAttachmentDecorator(projectName, ...) without first creating/waiting for the project like tracing.fixture.ts does, so if the SDK doesn’t auto-create missing projects the spec can fail at trace creation—should we add the same project setup here too?
Finding type: Logical Bugs | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts
around lines 68-70 within the sanity test `Trace attachments are correctly displayed for
traces created via decorator (sanity)`, the test calls
`helperClient.createTraceWithAttachmentDecorator(projectName, ...)` without ensuring the
project exists first (unlike the looped test which navigates and checks the project
before loading traces). Update this sanity test to explicitly create or wait for the
project using the existing fixtures/page objects (e.g., add a `ProjectsPage` flow to
`goto()` + `checkProjectExistsWithRetry(projectName, ...)` and/or call the same
project-setup helper that `tracing.fixture.ts` uses) before creating the trace. Then
keep the rest of the steps the same so trace creation won’t fail when the SDK/helper
service doesn’t auto-create missing projects.
There was a problem hiding this comment.
Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks for the clarification — that makes sense. I’ll remember this pattern and save it to memory once the PR is merged.
# Conflicts: # .github/workflows/build_e2e_docker.yaml
…wall Modeling against the post-split spec runtimes shows N=8 is the sweet spot: long-pole drops from 446s (N=4) to 252s, no idle shards (vs the 2-empty-shard waste from the pre-split N=8 experiment). N=10+ caps at 223s long-pole — the unavoidable feedback-scores + online-scoring-full-flow alphabetical pair — so N=16 would burn 60% more runner-minutes for zero additional speedup vs N=10. Predicted N=8 wall: ~15 min (vs the post-split N=4 measured 19 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fern PRs related too, FYI @thiagohora @andrescrz , similar to BE tests sharding, now for FE 👋 @alexkuzmik @aadereiko @AndreiCautisanu — looking for a sign-off to merge at N=8 and iterate from there. N=8 run came inRun 26185619407 — full success, 8 parallel shards on
So N=8 is ~12s faster than N=4, at the cost of ~22 extra runner-min per PR. The win is smaller than predicted: Playwright's contiguous-block sharder reclustered Why N≥10 doesn't helpThe bottleneck is one spec inside the long-pole shard, not the shard count.
Suggested path forwardInstead of more shards, I'd suggest we merge N=8 now and then iterate. The big remaining levers (in order):
We're already 30 → 18-20 min on PR. With (1)+(2) we can hit ~14, with (3) ~10, and the ~5 min goal is reachable but each step is its own project. Better to ship each rung, prove the number, then move on. TL;DR
OK to merge as-is? Will follow up on OPIK-6610 next.
|

Details
Status: parked / empty. Branch reset to an empty commit on top of
mainpending the GHCR build-once initiative. See OPIK-4911 for the full sharding plan.What was attempted first
The original commit on this branch bumped
workers: 1 → 4inplaywright.config.ts(intra-job parallelism) plusthreaded=Trueon the Flask test helper. The CI run regressed on every axis:Failures cluster around list-view helpers (
ProjectsPage.checkProjectExistsWithRetry → page.reload()) reportingnet::ERR_ABORTED; maybe frame was detached?under concurrent load. Even with random per-test names, the test environment has shared state — one Opik backend, one workspace (default), one ClickHouse, one Flask helper — that intra-job workers cannot escape. Bumpingworkerscannot fix this; sharding can.Sharding plan (this PR's eventual content)
Split the suite across N=4 GitHub matrix jobs, each running an isolated Opik backend:
npx playwright test --shard=${{ matrix.shard }}/4workers: 1per shard — sharding is the parallelism axis, not intra-shard workers./opik.sh --buildtax--reporter=blobper shard + a follow-upnpx playwright merge-reportsjob uploading one consolidated HTML artifactALLURE_JOB_RUN_ID: ${{ github.event.inputs.ALLURE_JOB_RUN_ID || github.run_id }}so all shards land in one TestOps launch (PR triggers currently leave this blank)Expected wall time after GHCR build-once lands: ~5–7 min tests + ~2 min setup = 8–10 min total, comfortably under the 15 min AC.
Change checklist
end2end_suites_typescript.yml(blocked on GHCR build-once)--reporter=blob+ merge-reports aggregation jobALLURE_JOB_RUN_IDdefaulted fromgithub.run_idfullregressiongreen 3 consecutive runsIssues
Testing
workflow_dispatchwithfullregression; verify wall timeDocumentation
No user-facing changes; CI-only.
🤖 Generated with Claude Code