Skip to content

[OPIK-4911] [INFRA] feat: shard E2E TypeScript tests in CI#6763

Open
JetoPistola wants to merge 7 commits into
danield/OPIK-6028-build-docker-images-once-parallelize-testsfrom
danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci
Open

[OPIK-4911] [INFRA] feat: shard E2E TypeScript tests in CI#6763
JetoPistola wants to merge 7 commits into
danield/OPIK-6028-build-docker-images-once-parallelize-testsfrom
danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci

Conversation

@JetoPistola
Copy link
Copy Markdown
Contributor

@JetoPistola JetoPistola commented May 19, 2026

Details

image image

Status: parked / empty. Branch reset to an empty commit on top of main pending the GHCR build-once initiative. See OPIK-4911 for the full sharding plan.

What was attempted first

The original commit on this branch bumped workers: 1 → 4 in playwright.config.ts (intra-job parallelism) plus threaded=True on the Flask test helper. The CI run regressed on every axis:

  • 39m52s wall time (vs ~30 min baseline)
  • 31 min test phase (vs ~20 min baseline)
  • 40 failed / 18 flaky / 43 passed

Failures cluster around list-view helpers (ProjectsPage.checkProjectExistsWithRetry → page.reload()) reporting net::ERR_ABORTED; maybe frame was detached? under concurrent load. Even with random per-test names, the test environment has shared state — one Opik backend, one workspace (default), one ClickHouse, one Flask helper — that intra-job workers cannot escape. Bumping workers cannot fix this; sharding can.

Sharding plan (this PR's eventual content)

Split the suite across N=4 GitHub matrix jobs, each running an isolated Opik backend:

strategy:
  fail-fast: false
  matrix:
    shard: [1, 2, 3, 4]
  • npx playwright test --shard=${{ matrix.shard }}/4
  • workers: 1 per shard — sharding is the parallelism axis, not intra-shard workers
  • Each shard pulls a pre-built Opik image from GHCR (depends on the GHCR build-once initiative), avoiding the 4× ./opik.sh --build tax
  • --reporter=blob per shard + a follow-up npx playwright merge-reports job uploading one consolidated HTML artifact
  • ALLURE_JOB_RUN_ID: ${{ github.event.inputs.ALLURE_JOB_RUN_ID || github.run_id }} so all shards land in one TestOps launch (PR triggers currently leave this blank)

Expected wall time after GHCR build-once lands: ~5–7 min tests + ~2 min setup = 8–10 min total, comfortably under the 15 min AC.

Change checklist

  • Matrix added to end2end_suites_typescript.yml (blocked on GHCR build-once)
  • --reporter=blob + merge-reports aggregation job
  • ALLURE_JOB_RUN_ID defaulted from github.run_id
  • fullregression green 3 consecutive runs

Issues

  • Resolves OPIK-4911
  • Blocked on: GHCR build-once initiative

Testing

  • Manual workflow_dispatch with fullregression; verify wall time
  • TestOps shows one consolidated Allure launch (not 4 fragments)
  • Single combined Playwright HTML report artifact

Documentation

No user-facing changes; CI-only.

🤖 Generated with Claude Code

@JetoPistola JetoPistola requested review from a team as code owners May 19, 2026 05:13
@github-actions github-actions Bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels May 19, 2026
forbidOnly: !!process.env.CI,
retries: process.env.CI ? 2 : 0,
workers: process.env.CI ? 1 : 1,
workers: process.env.CI ? 4 : 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CI runs with 4 workers, fixtures using test-*-${Date.now()} can collide in the same millisecond and helpers that create/delete by name may clobber each other; should we add a worker/test-specific suffix like workerIndex before parallel CI?

Finding type: Logical Bugs | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/playwright.config.ts around line 13 (the `workers`
setting), CI enables 4 parallel workers but the shared fixture helpers generate resource
names using millisecond timestamps (e.g., `test-categorical-${Date.now()}` etc.), so two
workers starting in the same ms can collide and one test deletes/overwrites the
other’s prompt/feedback/experiment. Refactor the fixture/helper naming logic to append
a worker-unique suffix (for example using Playwright’s `testInfo.workerIndex` or a
per-worker env/pid value) and ensure all create/update/delete paths use the same
composed name. After that change, re-enable `workers: process.env.CI ? 4 : 1` safely by
keeping uniqueness across workers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — Date.now() isn't used for project/dataset names in this PR. Project names use getRandomString(5) (see helpers/random.ts), giving ~12M variants per worker. The Date.now() usage exists only in feedback-experiments-prompts.fixture.ts (unchanged by this PR), and that fixture is exercised by a single shard at a time post-sharding, so cross-millisecond collisions across workers aren't possible. The broader question of fixture name uniqueness is tracked separately.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense. I'll save this to memory once the PR is merged.

print(f"🚀 Starting Opik Test Helper Service on port {port}")
print(f"📍 Health check available at: http://localhost:{port}/health")
app.run(host="0.0.0.0", port=port, debug=False)
app.run(host="0.0.0.0", port=port, debug=False, threaded=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With threaded=True, concurrent Flask requests can clobber process-global os.environ used by OPIK_WORKSPACE/OPIK_URL_OVERRIDE/OPIK_API_KEYroutes/projects.py mutates it per request while helpers read it at call time. Should we keep the handler single-threaded or move the client config to request-local state instead of mutating os.environ?

Finding type: Logical Bugs | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/test-helper-service/app.py around lines 138-138 where you call
`app.run(..., threaded=True)`, fix the race condition: Flask will handle overlapping
requests, but the helper routes/utils rely on process-global `os.environ` values for
OPik config that are mutated per request. Refactor by either removing `threaded=True` so
the server is single-threaded, or (better) eliminate per-request mutation of
`os.environ` by moving OPik workspace/URL/API key into request-local state (e.g., Flask
request context like `flask.g`) and pass that config to downstream helpers so requests
don’t clobber each other. Ensure routes/projects.py (81-87) and routes/utils.py
(19-25) stop reading/writing shared environment variables for request-specific values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — concern doesn't apply with the final sharding architecture. Each CI shard runs its own isolated Flask helper process (separate runner, separate container), so there's no inter-process concurrency on os.environ to worry about. Within a single shard workers: 1 means one Playwright worker drives one Flask process, so the previous concern (4 Playwright workers hitting one Flask) is moot. The threaded=True flag is harmless in the current single-worker-per-shard setup.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense with the shard-isolated Flask helper and single-worker setup. I'll save this to memory once the PR is merged.

@JetoPistola JetoPistola marked this pull request as draft May 19, 2026 05:52
@JetoPistola JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from ae9d21f to f00cf7d Compare May 19, 2026 06:01
@JetoPistola JetoPistola changed the title [OPIK-4911] [INFRA] feat: parallelize TS E2E tests with 4 workers in CI [OPIK-4911] [INFRA] feat: shard E2E TypeScript tests in CI May 19, 2026
@JetoPistola JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from f00cf7d to b386da8 Compare May 19, 2026 16:17
@JetoPistola JetoPistola changed the base branch from main to danield/OPIK-6028-build-docker-images-once-parallelize-tests May 19, 2026 16:17
JetoPistola added a commit that referenced this pull request May 19, 2026
…build-once

Cut Application E2E Tests (TypeScript) from ~30 min to a target ~8-10 min
by sharding across 4 GitHub matrix jobs that each pull a pre-built Opik
image from GHCR rather than rebuilding with ./opik.sh --build.

Workflow shape (end2end_suites_typescript.yml):
  build-opik (calls build_e2e_docker.yaml with include_frontend: true)
    -> run_suite matrix [1..4] (--shard=$i/4, workers: 1 per shard)
    -> merge-reports (downloads 4 blob artifacts, emits one HTML)

Each shard runs an isolated Opik stack (backend + python-backend +
frontend pulled from ghcr.io/comet-ml/opik/...:ci-e2e-<pr>), so the
single-backend contention that broke a previous workers:4 attempt
(40 failed / 18 flaky on PR #6763 history) is structurally impossible.

ALLURE_JOB_RUN_ID defaults to github.run_id so PR triggers aggregate
all 4 shards into one TestOps launch (previously env was only set on
workflow_dispatch input, leaving PR runs ungrouped).

build_e2e_docker.yaml: adds include_frontend boolean input (default
false). Builds and pushes ghcr.io/comet-ml/opik/opik-frontend:ci-e2e-<pr>
when set. Pattern mirrors include_guardrails.

cleanup_e2e_docker.yaml: opik-frontend added to IMAGES so the
cleanup-pr and cleanup-stale jobs sweep frontend tags too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JetoPistola JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from b386da8 to fcefc8f Compare May 19, 2026 16:26
@github-actions github-actions Bot added Infrastructure and removed python Pull requests that update Python code tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels May 19, 2026
…build-once

Cut Application E2E Tests (TypeScript) from ~30 min to a target ~8-10 min
by sharding across 4 GitHub matrix jobs that each pull a pre-built Opik
image from GHCR rather than rebuilding with ./opik.sh --build.

Workflow shape (end2end_suites_typescript.yml):
  build-opik (calls build_e2e_docker.yaml with include_frontend: true)
    -> run_suite matrix [1..4] (--shard=$i/4, workers: 1 per shard)
    -> merge-reports (downloads 4 blob artifacts, emits one HTML)

Each shard runs an isolated Opik stack (backend + python-backend +
frontend pulled from ghcr.io/comet-ml/opik/...:ci-e2e-<pr>), so the
single-backend contention that broke a previous workers:4 attempt
(40 failed / 18 flaky on PR #6763 history) is structurally impossible.

ALLURE_JOB_RUN_ID defaults to github.run_id so PR triggers aggregate
all 4 shards into one TestOps launch (previously env was only set on
workflow_dispatch input, leaving PR runs ungrouped).

build_e2e_docker.yaml: adds include_frontend boolean input (default
false). Builds and pushes ghcr.io/comet-ml/opik/opik-frontend:ci-e2e-<pr>
when set. Pattern mirrors include_guardrails.

cleanup_e2e_docker.yaml: opik-frontend added to IMAGES so the
cleanup-pr and cleanup-stale jobs sweep frontend tags too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JetoPistola JetoPistola force-pushed the danield/OPIK-4911-parallelize-e2e-typescript-tests-in-ci branch from fcefc8f to 25d475f Compare May 19, 2026 16:30
JetoPistola and others added 3 commits May 19, 2026 21:37
…ards

The previous attempt defaulted ALLURE_JOB_RUN_ID to github.run_id, which
allurectl interprets as "join this pre-existing job run." TestOps had
never heard of github.run_id, so every shard got 404 from allurectl
watch and the workflow exit 1'd before tests ran.

Switch to the documented parallel-jobs pattern:
  setup-allure  -> allurectl launch create, captures launch_id
  run_suite[N]  -> ALLURE_LAUNCH_ID env from needs.setup-allure outputs
  close-allure  -> allurectl launch close on always()

Drops the github.run_id fallback on ALLURE_JOB_RUN_ID (which still wins
when workflow_dispatch is invoked with an explicit input).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-results

Two small follow-ups after the first green sharded run:

1. Drop allure-results-{1..4} artifact uploads. Per-shard Allure JSON
   was redundant with the consolidated TestOps launch that all 4 shards
   already upload into via ALLURE_LAUNCH_ID. Saves ~300KB/run and
   declutters the artifacts list.

2. Upload tests_end_to_end/typescript-tests/test-results per shard when
   the shard fails. Without this, the merged Playwright HTML report's
   "Trace" / "Video" / "Screenshot" links point at files that were
   never uploaded, so reviewers debugging a failed shard see 404s.
   Guarded by if: failure() + if-no-files-found: ignore to avoid
   bloating storage on green runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first sharded run hit 20m59s with N=4. Per-shard runtime analysis
showed severe imbalance (shard 1: 506s, shard 2: 157s) caused by
Playwright's deterministic file-grouping landing fat specs together
rather than by lack of shard count.

The math says N=8 saves at most ~30s on the long-pole (still bounded by
the longest single spec file: trace-attachments.spec.ts at 298.5s).
Cheap to verify empirically: one CI run, then we have data to compare.

If the savings are modest, this confirms the shard-balance ticket is
the real next step. If they're material, we revisit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JetoPistola added a commit that referenced this pull request May 20, 2026
The build job's push step pushes images sequentially (~92s total: backend
20s + python-backend 66s + frontend 6s observed on PR #6763). Switch to
background pushes with wait + accumulated failure check.

Expected: wall time = max(push_time_per_image) instead of sum. On the
observed numbers that's ~66s instead of ~92s, saving ~26s per CI run
across all 6 consumer workflows + the sharded TS E2E suite.

Tag step kept sequential (metadata op, no measurable benefit from
parallelizing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The N=8 experiment confirmed shard count isn't the bottleneck: 19m13s
wall (vs N=4's 20m21s) with two empty runners burning CI minutes. The
real long-pole is two fat parametrized specs that each fully occupy a
single shard:

  online-scoring.spec.ts    12 parametrized tests x ~24s = 290s
  trace-attachments.spec.ts 32 parametrized tests x ~9s  = 298s

Together they account for 45% of total test CPU time. Playwright shards
at file granularity, so no shard count can split them; the only path
forward is breaking each into smaller files.

Splits:

  online-scoring.spec.ts (290s) ->
    online-scoring-rule-creation.spec.ts   (~92s, 6 parametrized tests)
    online-scoring-full-flow.spec.ts       (~198s, 6 parametrized tests)

  trace-attachments.spec.ts (298s) ->
    trace-attachments-low-level.spec.ts    (~90s, 10 parametrized)
    trace-attachments-decorator.spec.ts    (~99s, 10 parametrized + 1 sanity)
    trace-attachments-span.spec.ts         (~99s, 10 parametrized + 1 sanity)
    trace-attachments-shared.ts            (extracted ATTACHMENTS const)

Test tags (@fullregression / @sanity / @HappyPaths / @Tracing /
@attachments / @onlinescoring) preserved verbatim so grep-based suite
selection is unchanged.

Predicted long-pole drops from 506s -> 446s; predicted wall time drops
from ~20 min -> ~16 min. Shard 2 still owns both online-scoring halves
(alphabetically adjacent under online-scoring/) which is the new floor.

Also: revert N=8 -> N=4. The experiment showed N=8 wastes ~6
runner-minutes on empty shards 6 and 7 for only 1m08s wall savings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels May 20, 2026
console.log(`Moderation rule creation test completed successfully for ${providerConfig.display_name} - ${modelConfig.name}`);
});

test(`Online scoring full moderation flow with ${providerConfig.display_name} ${modelConfig.name} @sanity @happypaths @fullregression @onlinescoring`, async ({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In online-scoring-full-flow.spec.ts, the parametrized test(�0nline scoring full moderation flow with ...�1, async (...) => { ... }) seems to do timeout/setup before the required test.info().annotations.push({ type: 'description', ... }), so the scenario may show up in reports without description metadata—should we move that annotation to the first statement?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-full-flow.spec.ts
around line 21 (the parametrized `test('Online scoring full moderation flow with ...',
async (...) => { ... })` callback), ensure the test body starts by pushing the required
description annotation via `test.info().annotations.push({ type: 'description', ... })`
as the very first statement. Move any timeout/setup/waits that currently run before this
annotation to after the annotation so the metadata is always present in reports. Verify
the annotation payload matches the convention expected by
`.agents/skills/playwright-e2e/test-conventions.md`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context — understood that this is a pre-existing test pattern and out of scope for this PR. I’ll save this to memory after the PR is merged.

Comment on lines +15 to +19
test(`Create moderation rule with ${providerConfig.display_name} ${modelConfig.name} @fullregression @onlinescoring`, async ({
page,
projectName,
helperClient,
createProjectApi,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop-generated online-scoring tests run work before the first test.info().annotations.push({ type: 'description', ... }), but the convention in .agents/skills/playwright-e2e/test-conventions.md expects that at the top—should we move/guarantee the description push before any other statements to avoid missing metadata in CI?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts
around lines 15-19 inside the `for (const { providerName, modelConfig, providerConfig }
of enabledModels)` generated `test(...)` callback, the test body starts work immediately
and does not push the required `test.info().annotations.push({ type: 'description', ...
})` as the first statement. Refactor by adding a description annotation call as the very
first line inside the async test function (before creating `AIProviderSetupHelper` /
`RulesPage` or calling `test.step`), using the provider/model values to populate the
message. Verify that the annotation is present for every loop-generated test and that CI
sees the description metadata.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense — this is a pre-existing test pattern and outside this PR’s scope. I’ll save that to memory once the PR is merged.

Comment on lines +26 to +30
await test.step('Setup AI provider if needed', async () => {
await providerSetupHelper.setupProviderIfNeeded(providerName, providerConfig);
});

await test.step('Navigate to project and rules tab', async () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the required “step text” comment right before each test.step() in .agents/skills/playwright-e2e, since the steps are emitted directly and the spec no longer matches the readability convention.

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts
around lines 26-47, inside the `test( Create moderation rule with ... )` body, the
multiple `test.step()` calls (“Setup AI provider if needed”, “Navigate to project
and rules tab”, “Create moderation rule”, “Cleanup provider”) are missing the
required guideline comment. Refactor by adding a comment immediately before each `await
test.step('...')` with the exact same step text so the emitted step text is traceable.
Ensure every `test.step()` in that block has the preceding comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that context helps. I’ll remember this for similar specific-code findings once the PR is merged.

const ruleName = `Test Moderation Rule - ${modelConfig.name}`;
await rulesPage.createModerationRule(ruleName, providerConfig.display_name, modelConfig.ui_selector);

await expect(page.getByText(ruleName)).toBeVisible();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test bypasses the page-object by calling page.getByText(ruleName) directly, and RulesPage only exposes navigation/creation helpers—can we add a RulesPage assertion like checkRuleExists and use that instead?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/online-scoring/online-scoring-rule-creation.spec.ts
around lines 37-43 (the “Create moderation rule” test.step), remove the raw locator
assertion `page.getByText(ruleName)` and instead assert rule presence through the
RulesPage page object. Refactor by adding a new RulesPage method (e.g.,
`checkRuleExists(ruleName)` or `expectRuleVisible(ruleName)`) inside the corresponding
page object, using whatever locator/selector the page object catalog allows for rules.
Then update this spec to call that new RulesPage assertion so selector ownership remains
in RulesPage and the test only orchestrates actions/assertions via page objects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense — I’ll treat this as out of scope for PRs like this when it’s a pre-existing test-pattern issue already tracked separately. I’ll save this to memory once the PR is merged.

Comment on lines +8 to +12
base(`Trace attachment ${attachment.name} is correctly displayed via low-level client @fullregression @tracing @attachments`, async ({
page,
projectName,
helperClient,
}) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure each per-attachment low-level-client test pushes test.info().annotations.push({ type: 'description', ... }) before trace/navigate? Otherwise the convention in .agents/skills/playwright-e2e/test-conventions.md will be missed for looped cases and their reports may lack description.

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-low-level.spec.ts
around lines 8-12, inside the per-attachment `base(...)` test callback for `Trace
attachment ${attachment.name}`, add the required `test.info().annotations.push({ type:
'description', ... })` at the very start of the async test body (before the first
`base.step`). Populate the description with what the test is doing (e.g., creating a
trace with the given attachment via the low-level client and verifying the attachment
appears). This ensures all looped low-level cases comply with
`.agents/skills/playwright-e2e/test-conventions.md` report requirements.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context — understood that this is a pre-existing test pattern and out of scope for this PR, with follow-up tracked in OPIK-6610. I'll save this to memory once the PR is merged.

Comment on lines +18 to +22
await base.step('Create trace with span attachment', async () => {
const result = await helperClient.createTraceWithSpanAttachment(projectName, attachment.path);
attachmentName = result.attachmentName;
spanName = result.spanName;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helperClient.createTraceWithSpanAttachment(projectName, ...) runs before project setup — should we add an explicit "create project" step mirroring tracing.fixture.ts before invoking it?

Finding type: Logical Bugs | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
In tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-span.spec.ts around
lines 18-22 inside the first step "Create trace with span attachment", ensure the
project is created and ready before calling
helperClient.createTraceWithSpanAttachment(projectName, ...). The current code assumes
the backend auto-creates missing projects, but the helper-service route only forwards
project_name and does not create/wait for the project. Refactor by adding an explicit
"create project / wait for project exists" step (mirroring what tracing.fixture.ts does)
right after entering the test and before invoking createTraceWithSpanAttachment, so the
trace creation and subsequent span/attachment UI assertions run reliably.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense. I’ll save this to memory once the PR is merged.

Comment on lines +80 to +84
await base.step('Create trace with span attachment', async () => {
const result = await helperClient.createTraceWithSpanAttachment(projectName, firstAttachment.path);
attachmentName = result.attachmentName;
spanName = result.spanName;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanity test uses helperClient.createTraceWithSpanAttachment(projectName, ...) before the project exists, while tracing.fixture.ts creates/waits for the project; can we create the project first in this test too?

Finding type: Logical Bugs | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-span.spec.ts around
lines 53-112, inside the sanity test `Span attachments are correctly displayed within
traces (sanity)`, update the setup so a project exists before calling
`helperClient.createTraceWithSpanAttachment(projectName, firstAttachment.path)` in the
“Create trace with span attachment” step. Right now, `projectName` comes only from
the base fixture and no explicit project creation/wait happens before trace creation, so
trace creation can fail before the attachment assertions. Refactor by adding a project
creation/check-and-wait step (using the existing fixture utilities or the ProjectsPage
flow) before the trace creation step, ensuring the project is persisted and ready for
the helper-service call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense. I’ll save this to memory once the PR is merged.

Comment on lines +8 to +12
base(`Trace attachment ${attachment.name} is correctly displayed via decorator @fullregression @tracing @attachments`, async ({
page,
projectName,
helperClient,
}) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-attachment decorator tests start with trace creation/navigation before test.info().annotations.push({ type: 'description', ... }), but .agents/skills/playwright-e2e/test-conventions.md wants that description annotation at the start of each test body—should we push it first for all looped decorator cases?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts
around lines 6-43 inside the looped `base(...)` test for each `ATTACHMENTS` item, add
the required `test.info().annotations.push({ type: 'description', description: ... })`
at the very start of the test body (right after the async callback begins), before
creating the trace or navigating pages. Refactor the description to mention the specific
attachment being tested (e.g., attachment.name/path) and the intended behavior
(trace-level attachment displayed via decorator). Ensure this annotation is pushed for
every loop iteration so the reports include the required `description` field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this is a pre-existing pattern in the codebase, not introduced by this PR. The original online-scoring.spec.ts / trace-attachments.spec.ts had the same shape (description annotations on @sanity tests only, not on loop-generated parametrized tests). This PR's scope is CI sharding + GHCR build-once, not test-convention compliance. Will be tracked in a follow-up ticket alongside the broader test-quality cleanup (see OPIK-6610).

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense — this is a pre-existing test pattern and outside this PR’s scope. I’ll save this to memory once the PR is merged.

Comment on lines +16 to +20
await base.step('Create trace with attachment via decorator', async () => {
attachmentName = await helperClient.createTraceWithAttachmentDecorator(projectName, attachment.path);
});

await base.step('Navigate to project traces page', async () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helperClient.createTraceWithAttachmentDecorator is called before the project exists — should we add a create+wait step (like tracing.fixture.ts does) before trace creation, and also add the required step-text comment before each base.step() call to match the documented readability convention?

Finding types: Logical Bugs AI Coding Guidelines | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. 1. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts
around lines 16-18, refactor by adding an explicit "create project and wait until ready"
step before the trace-creation step, reusing the same approach/logic as the existing
tracing.fixture.ts (the one that calls createProject and waits for it). Do the same for
the sanity test around lines 68-70 so both scenarios reliably create the project before
creating the trace. 2. In the same file, add a comment line immediately before each
`base.step(...)` call that exactly contains the step text passed to `base.step`. This
applies to all `base.step(...)` calls in the first base.describe loop (lines 16-41) and
the sanity test block (lines 68-93).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense. I’ll remember this pattern and save it to memory once the PR is merged.

Comment on lines +68 to +70
await base.step('Create trace with attachment via decorator', async () => {
attachmentName = await helperClient.createTraceWithAttachmentDecorator(projectName, firstAttachment.path);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanity test calls helperClient.createTraceWithAttachmentDecorator(projectName, ...) without first creating/waiting for the project like tracing.fixture.ts does, so if the SDK doesn’t auto-create missing projects the spec can fail at trace creation—should we add the same project setup here too?

Finding type: Logical Bugs | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
tests_end_to_end/typescript-tests/tests/tracing/trace-attachments-decorator.spec.ts
around lines 68-70 within the sanity test `Trace attachments are correctly displayed for
traces created via decorator (sanity)`, the test calls
`helperClient.createTraceWithAttachmentDecorator(projectName, ...)` without ensuring the
project exists first (unlike the looped test which navigates and checks the project
before loading traces). Update this sanity test to explicitly create or wait for the
project using the existing fixtures/page objects (e.g., add a `ProjectsPage` flow to
`goto()` + `checkProjectExistsWithRetry(projectName, ...)` and/or call the same
project-setup helper that `tracing.fixture.ts` uses) before creating the trace. Then
keep the rest of the steps the same so trace creation won’t fail when the SDK/helper
service doesn’t auto-create missing projects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — false positive. The test as base import comes from fixtures/tracing.fixture.ts, which extends base.fixture.ts that provides projectName via generateProjectName(). The Opik SDK auto-creates the project on first trace ingest, so an explicit pre-create step isn't required — this is the same pattern as the pre-split trace-attachments.spec.ts which passed CI consistently.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification — that makes sense. I’ll remember this pattern and save it to memory once the PR is merged.

JetoPistola and others added 2 commits May 20, 2026 10:51
# Conflicts:
#	.github/workflows/build_e2e_docker.yaml
…wall

Modeling against the post-split spec runtimes shows N=8 is the sweet
spot: long-pole drops from 446s (N=4) to 252s, no idle shards (vs the
2-empty-shard waste from the pre-split N=8 experiment). N=10+ caps at
223s long-pole — the unavoidable feedback-scores + online-scoring-full-flow
alphabetical pair — so N=16 would burn 60% more runner-minutes for zero
additional speedup vs N=10.

Predicted N=8 wall: ~15 min (vs the post-split N=4 measured 19 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JetoPistola
Copy link
Copy Markdown
Contributor Author

JetoPistola commented May 20, 2026

image

Fern PRs related too, FYI @thiagohora @andrescrz , similar to BE tests sharding, now for FE

👋 @alexkuzmik @aadereiko @AndreiCautisanu — looking for a sign-off to merge at N=8 and iterate from there.

N=8 run came in

Run 26185619407 — full success, 8 parallel shards on fullregression.

Config Wall Long-pole shard Spread (fastest → slowest) Runner-min
N=4 + splits (currently shipped) 19m00s 10m20s n/a ~36
N=8 + splits (this run) 18m48s 9m38s (shard 2) 4m38s → 9m38s ~58

So N=8 is ~12s faster than N=4, at the cost of ~22 extra runner-min per PR. The win is smaller than predicted: Playwright's contiguous-block sharder reclustered online-scoring-full-flow with neighbours onto shard 2, while shard 4 finished in 4m38s and idled.

Why N≥10 doesn't help

The bottleneck is one spec inside the long-pole shard, not the shard count. online-scoring-full-flow = 199s of sequential LLM-moderation tests; adding shards doesn't subdivide it. We'd just pay for more idle runners. Math:

N Long-pole Wall Runner-min vs N=8
8 ~9m+ ~18 min ~58
10 ~9m+ ~18 min ~67 +9 runner-min, 0 wall
16 ~9m+ ~18 min ~107 +49 runner-min, 0 wall

Suggested path forward

Instead of more shards, I'd suggest we merge N=8 now and then iterate. The big remaining levers (in order):

  1. Parallelize online-scoring-full-flow — 6 sequential LLM-moderation tests inside one spec. test.describe.parallel or a faster moderation stub drops the spec from 199s → ~70-100s. Expected wall: ~14 min. Tracked in OPIK-6610 (Option 4).
  2. Split / parallelize trace-spans (173s) — same pattern. Would become the next long-pole if we don't.
  3. Cut the 8.5-min Docker build — runs on every PR with mostly cold caches. Layer-cache reuse from main, or build-once-per-day cached tags. Expected wall: ~10 min. No ticket yet — I can file one.
  4. Past that — ~5 min floor, but it requires rewriting individual tests or moving to faster runners.

We're already 30 → 18-20 min on PR. With (1)+(2) we can hit ~14, with (3) ~10, and the ~5 min goal is reachable but each step is its own project. Better to ship each rung, prove the number, then move on.

TL;DR

  • N=8 is the right place to land for now.
  • N=10/12/16 isn't worth the runner-min until the slowest spec is fixed.
  • Slow-test work tracked in OPIK-6610; Docker-build optimization needs a new ticket.

OK to merge as-is? Will follow up on OPIK-6610 next.

Diagram of the ladder + the per-shard time composition is at diagrams/opik-4911-next-steps-diagram.html in the repo — I'll paste an image of it in a follow-up comment.

@JetoPistola JetoPistola marked this pull request as ready for review May 20, 2026 20:12
@JetoPistola JetoPistola requested a review from a team as a code owner May 20, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Infrastructure tests Including test files, or tests related like configuration. typescript *.ts *.tsx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant