Skip to content

Conversation

@THardy98
Copy link
Contributor

@THardy98 THardy98 commented Dec 18, 2025

What was changed

Add run-scenario-ecs.yml GHA workflow.

Can be used to manually dispatch worker performance tests that run on cloud (AWS Fargate).

Some characteristics:

  • Worker runs in a separate Fargate task for isolation (separate firecracker vm)
  • Server instance runs in cloud
  • Can provide a GH commit to run against a specific version of worker
  • Can specify is-experiment to distinguish from regular benchmark testing (useful for ad-hoc testing/experimenting, i.e. to test local perf-sensitive changes without disrupting signal from nightlies)

Some improvements to be made in the future:

  • the workflow is quite long (though ~80 lines are inputs/env vars), could translate to code/scripts, and make it trigger-able from the CLI
  • formalize the ECS setup with IaC
  • handle worker crashes mid-way through the scenario (orphans the client)

NOTE:
Leaving draft, to revert some changes before merging:

  • DURATION should default to 5h (shortened for testing)
  • TIMEOUT should default to 5h30m (shortened for testing)
  • remove hook that triggers workflow on push
  • change the Checkout Omes branch (currently per-version-taskdefs, will need to be main)

Why?

Useful to run worker performance tests

  1. How was this tested:
    Numerous trial runs.

@THardy98 THardy98 requested a review from Sushisource December 18, 2025 09:53
docker push $ECR_REPO:$IMAGE_TAG
echo "WORKER_IMAGE=$ECR_REPO:$IMAGE_TAG" >> $GITHUB_ENV

- name: Register worker task definition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to leave a bajillion "task definitions" lying around?

I'm no AWS expert but I do know one thing that's always a problem is cleaning stuff up.

Copy link
Contributor Author

@THardy98 THardy98 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task definitions belong to a "task definition family". When you register new task definitions under a family, they are called "task definition revisions". A family is like a namespace for a bunch of different versions ("revisions") of a task definition. For example:

Task definition family: go-omes-worker

Task definition revisions:
go-omes-worker:1 (initial)
go-omes-worker:2 (registered a new task definition under the "go-omes-worker" family)
go-omes-worker:3 (...)

As I understand it, there's no monetary cost to having many task definition revisions (they are free). AWS imposes a hard limit of 1M per task definition family. So there's no immediate concern here.

We could likely add a hook to clean up the task definitions in the same trap we use to stop the worker ECS task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

-t $ECR_REPO:$IMAGE_TAG \
.

docker push $ECR_REPO:$IMAGE_TAG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need to clean up old images at some point. We get charged for those I'd assume?

Copy link
Contributor Author

@THardy98 THardy98 Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have lifecycle policies in our ECR instance, they are:

  • Delete untagged after 1 day (failed/partial pushes)
  • Keep only the 2 most recent python-worker-* images
  • Keep only the 2 most recent java-worker-* images
  • Keep only the 2 most recent go-worker-* images
  • Keep only the 2 most recent typescript-worker-* images
  • Keep only the 2 most recent dotnet-worker-* images
  • Keep only the 2 most recent omes-client* images

In principle, we could be really aggressive and delete all images after a day, since we're pushing them on each run anyway (i'm open to changing it if we want). The cost is pretty cheap regardless (100GB of storage is $10)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dope. That works.

-t $ECR_REPO:$IMAGE_TAG \
.

docker push $ECR_REPO:$IMAGE_TAG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fact that there's only one client image we re-write potentially going to cause issues? Should we tag this with omes' git hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the existing image will still exist, but only the latest omes-client image will be tagged

Comment on lines +5 to +10
cleanup() {
if [ -n "$WORKER_TASK_ARN" ] && [ -n "$ECS_CLUSTER" ]; then
echo "Stopping worker task: $WORKER_TASK_ARN"
aws ecs stop-task --cluster "$ECS_CLUSTER" --task "$WORKER_TASK_ARN" --region "${AWS_REGION:-us-west-2}" --reason "Client exited" || true
fi
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file name is about prometheus, but it's also just generally running omes and cleaning up the worker. Probably want a more generic file name

@@ -0,0 +1,81 @@
# CLI image with Prometheus for scenarios that need worker metric scraping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly duped with the other docker file. Can this one just use that one as a layer, or otherwise dedupe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes likely

@THardy98 THardy98 requested a review from picatz December 19, 2025 00:42
@THardy98
Copy link
Contributor Author

@picatz would you mind taking a look at this, double check i'm not leaking anything?

…rate dockerfile used because existing cli.Dockerfile uses a distroless image)
…heus endpoint to optionally push metrics to provided S3 bucket
…o/java is master and for core-based languages its main)
… sdkbuild issue (npm ci with no package-lock.json), remove previous npx pnpm change (wrong - that was for the runner...)
@THardy98 THardy98 force-pushed the per-version-taskdefs branch from ecaaea1 to 6d67a08 Compare December 19, 2025 14:52
@THardy98
Copy link
Contributor Author

(^ long commit blurb - rebased)

@THardy98 THardy98 requested a review from mjameswh December 19, 2025 20:50
@THardy98
Copy link
Contributor Author

THardy98 commented Dec 20, 2025

Closing, moved to a private repo
(deleting branch)

@THardy98 THardy98 closed this Dec 20, 2025
@THardy98 THardy98 deleted the per-version-taskdefs branch December 20, 2025 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants