Worker Performance Testing on ECS #274

THardy98 · 2025-12-18T09:53:19Z

What was changed

Add run-scenario-ecs.yml GHA workflow.

Can be used to manually dispatch worker performance tests that run on cloud (AWS Fargate).

Some characteristics:

Worker runs in a separate Fargate task for isolation (separate firecracker vm)
Server instance runs in cloud
Can provide a GH commit to run against a specific version of worker
Can specify is-experiment to distinguish from regular benchmark testing (useful for ad-hoc testing/experimenting, i.e. to test local perf-sensitive changes without disrupting signal from nightlies)

Some improvements to be made in the future:

the workflow is quite long (though ~80 lines are inputs/env vars), could translate to code/scripts, and make it trigger-able from the CLI
formalize the ECS setup with IaC
handle worker crashes mid-way through the scenario (orphans the client)

NOTE:
Leaving draft, to revert some changes before merging:

DURATION should default to 5h (shortened for testing)
TIMEOUT should default to 5h30m (shortened for testing)
remove hook that triggers workflow on push
change the Checkout Omes branch (currently per-version-taskdefs, will need to be main)

Why?

Useful to run worker performance tests

How was this tested:
Numerous trial runs.

.github/workflows/run-scenario-ecs.yml

Sushisource · 2025-12-18T18:03:14Z

.github/workflows/run-scenario-ecs.yml

+          docker push $ECR_REPO:$IMAGE_TAG
+          echo "WORKER_IMAGE=$ECR_REPO:$IMAGE_TAG" >> $GITHUB_ENV
+
+      - name: Register worker task definition


Are we going to leave a bajillion "task definitions" lying around?

I'm no AWS expert but I do know one thing that's always a problem is cleaning stuff up.

Task definitions belong to a "task definition family". When you register new task definitions under a family, they are called "task definition revisions". A family is like a namespace for a bunch of different versions ("revisions") of a task definition. For example:

Task definition family: go-omes-worker Task definition revisions: go-omes-worker:1 (initial) go-omes-worker:2 (registered a new task definition under the "go-omes-worker" family) go-omes-worker:3 (...)

As I understand it, there's no monetary cost to having many task definition revisions (they are free). AWS imposes a hard limit of 1M per task definition family. So there's no immediate concern here.

We could likely add a hook to clean up the task definitions in the same trap we use to stop the worker ECS task.

Works for me.

Sushisource · 2025-12-18T18:06:47Z

.github/workflows/run-scenario-ecs.yml

+            -t $ECR_REPO:$IMAGE_TAG \
+            .
+
+          docker push $ECR_REPO:$IMAGE_TAG


We probably also need to clean up old images at some point. We get charged for those I'd assume?

We have lifecycle policies in our ECR instance, they are:

Delete untagged after 1 day (failed/partial pushes)

Keep only the 2 most recent python-worker-* images

Keep only the 2 most recent java-worker-* images

Keep only the 2 most recent go-worker-* images

Keep only the 2 most recent typescript-worker-* images

Keep only the 2 most recent dotnet-worker-* images

Keep only the 2 most recent omes-client* images

In principle, we could be really aggressive and delete all images after a day, since we're pushing them on each run anyway (i'm open to changing it if we want). The cost is pretty cheap regardless (100GB of storage is $10)

Dope. That works.

Sushisource · 2025-12-18T18:07:15Z

.github/workflows/run-scenario-ecs.yml

+            -t $ECR_REPO:$IMAGE_TAG \
+            .
+
+          docker push $ECR_REPO:$IMAGE_TAG


Is the fact that there's only one client image we re-write potentially going to cause issues? Should we tag this with omes' git hash?

No, the existing image will still exist, but only the latest omes-client image will be tagged

Sushisource · 2025-12-18T18:10:34Z

dockerfiles/cli-prometheus-entrypoint.sh

+cleanup() {
+  if [ -n "$WORKER_TASK_ARN" ] && [ -n "$ECS_CLUSTER" ]; then
+    echo "Stopping worker task: $WORKER_TASK_ARN"
+    aws ecs stop-task --cluster "$ECS_CLUSTER" --task "$WORKER_TASK_ARN" --region "${AWS_REGION:-us-west-2}" --reason "Client exited" || true
+  fi
+}


This file name is about prometheus, but it's also just generally running omes and cleaning up the worker. Probably want a more generic file name

Sushisource · 2025-12-18T18:12:03Z

dockerfiles/cli-prometheus.Dockerfile

@@ -0,0 +1,81 @@
+# CLI image with Prometheus for scenarios that need worker metric scraping


This is mostly duped with the other docker file. Can this one just use that one as a layer, or otherwise dedupe?

THardy98 · 2025-12-19T00:43:18Z

@picatz would you mind taking a look at this, double check i'm not leaking anything?

…rate dockerfile used because existing cli.Dockerfile uses a distroless image)

…heus endpoint to optionally push metrics to provided S3 bucket

… fallback default values

…o/java is master and for core-based languages its main)

… sdkbuild issue (npm ci with no package-lock.json), remove previous npx pnpm change (wrong - that was for the runner...)

THardy98 · 2025-12-19T14:52:50Z

(^ long commit blurb - rebased)

THardy98 · 2025-12-20T04:24:09Z

Closing, moved to a private repo
(deleting branch)

THardy98 requested a review from Sushisource December 18, 2025 09:53

semgrep-managed-scans bot reviewed Dec 18, 2025

View reviewed changes

.github/workflows/run-scenario-ecs.yml Show resolved Hide resolved

Sushisource reviewed Dec 18, 2025

View reviewed changes

THardy98 requested a review from picatz December 19, 2025 00:42

THardy98 added 25 commits December 19, 2025 09:49

added dirs

fd96bfa

removed containers dir - can reuse existing worker dockerfiles

e55b89c

add dockerfile for omes-client that includes prometheus sidecar (sepa…

edd1434

…rate dockerfile used because existing cli.Dockerfile uses a distroless image)

add ecs tasks

7aa8fd6

add github action to run omes scenarios on ECS

4d35a1e

fix --prom-instance-config flag, was not reading arg correctly before

3f049aa

bump prom version in cli-prometheus.Dockerfile

26ce86b

Add workflow to run worker perf tests on ECS. Add logic in cli-promet…

e7df26f

…heus endpoint to optionally push metrics to provided S3 bucket

Add secret variable (correct naming)

cc1e7ad

Allow for push/nightly triggers by making input not required. Provide…

8707222

… fallback default values

use ubuntu-latest

0c075df

checkout sdk debugging

f0d9992

use inputs directly in Checkout SDK

e6cd8f3

disable recursive

a900c02

dynamically determine SDK_GITREF (necessary because the default for g…

ec71fea

…o/java is master and for core-based languages its main)

add gh token

564722d

use ecs-perf-test for testing (has cli-prometheus Dockerfile)

cf52739

checkout Omes recursive (forgot upstream proto submodule)

dfce1bf

add missing brackets around string concat in jq

99ee33c

generate task definitions per worker image

fd382f3

handle authHeader as apikey for python worker

0f7674d

add task role arn to task definition, misc fixes

482db36

test non-Go workers, add fixes to read --auth-header flag as api key

3f1e89b

fixes for TS worker, included a PREBUILD_STEP due to avoid downstream…

688c3d5

… sdkbuild issue (npm ci with no package-lock.json), remove previous npx pnpm change (wrong - that was for the runner...)

long-running test on Go SDK v1.38.0

1fc55ad

THardy98 added 6 commits December 19, 2025 09:51

revert changes for testing

c11fd24

fix for semgrep scan

cf7b6eb

(python) trigger long-running test

3aab2d6

(ts) trigger long-running test

95b1332

(dotnet) trigger long-running test

1161858

rebase on main

6d67a08

THardy98 force-pushed the per-version-taskdefs branch from ecaaea1 to 6d67a08 Compare December 19, 2025 14:52

THardy98 added 4 commits December 19, 2025 09:56

copy ./metrics for cli-prometheus dockerfile

4c42cf3

linting / formatting

43bdeee

remove redundant flag set overriding flags

52270af

rebase overwrote fix for prom-config.yml path

1f0b2c7

THardy98 requested a review from mjameswh December 19, 2025 20:50

THardy98 closed this Dec 20, 2025

THardy98 deleted the per-version-taskdefs branch December 20, 2025 04:24

		@@ -0,0 +1,81 @@
		# CLI image with Prometheus for scenarios that need worker metric scraping

Worker Performance Testing on ECS #274

Worker Performance Testing on ECS #274

Uh oh!

Conversation

THardy98 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was changed

Why?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

THardy98 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

THardy98 Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

THardy98 commented Dec 19, 2025

Uh oh!

THardy98 commented Dec 19, 2025

Uh oh!

THardy98 commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

THardy98 commented Dec 18, 2025 •

edited

Loading

THardy98 Dec 18, 2025 •

edited

Loading

THardy98 Dec 18, 2025 •

edited

Loading

THardy98 commented Dec 20, 2025 •

edited

Loading