-
Notifications
You must be signed in to change notification settings - Fork 18
Worker Performance Testing on ECS #274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| docker push $ECR_REPO:$IMAGE_TAG | ||
| echo "WORKER_IMAGE=$ECR_REPO:$IMAGE_TAG" >> $GITHUB_ENV | ||
|
|
||
| - name: Register worker task definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to leave a bajillion "task definitions" lying around?
I'm no AWS expert but I do know one thing that's always a problem is cleaning stuff up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task definitions belong to a "task definition family". When you register new task definitions under a family, they are called "task definition revisions". A family is like a namespace for a bunch of different versions ("revisions") of a task definition. For example:
Task definition family: go-omes-worker
Task definition revisions:
go-omes-worker:1 (initial)
go-omes-worker:2 (registered a new task definition under the "go-omes-worker" family)
go-omes-worker:3 (...)
As I understand it, there's no monetary cost to having many task definition revisions (they are free). AWS imposes a hard limit of 1M per task definition family. So there's no immediate concern here.
We could likely add a hook to clean up the task definitions in the same trap we use to stop the worker ECS task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
| -t $ECR_REPO:$IMAGE_TAG \ | ||
| . | ||
|
|
||
| docker push $ECR_REPO:$IMAGE_TAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably also need to clean up old images at some point. We get charged for those I'd assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have lifecycle policies in our ECR instance, they are:
- Delete untagged after 1 day (failed/partial pushes)
- Keep only the 2 most recent python-worker-* images
- Keep only the 2 most recent java-worker-* images
- Keep only the 2 most recent go-worker-* images
- Keep only the 2 most recent typescript-worker-* images
- Keep only the 2 most recent dotnet-worker-* images
- Keep only the 2 most recent omes-client* images
In principle, we could be really aggressive and delete all images after a day, since we're pushing them on each run anyway (i'm open to changing it if we want). The cost is pretty cheap regardless (100GB of storage is $10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dope. That works.
| -t $ECR_REPO:$IMAGE_TAG \ | ||
| . | ||
|
|
||
| docker push $ECR_REPO:$IMAGE_TAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fact that there's only one client image we re-write potentially going to cause issues? Should we tag this with omes' git hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the existing image will still exist, but only the latest omes-client image will be tagged
| cleanup() { | ||
| if [ -n "$WORKER_TASK_ARN" ] && [ -n "$ECS_CLUSTER" ]; then | ||
| echo "Stopping worker task: $WORKER_TASK_ARN" | ||
| aws ecs stop-task --cluster "$ECS_CLUSTER" --task "$WORKER_TASK_ARN" --region "${AWS_REGION:-us-west-2}" --reason "Client exited" || true | ||
| fi | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file name is about prometheus, but it's also just generally running omes and cleaning up the worker. Probably want a more generic file name
| @@ -0,0 +1,81 @@ | |||
| # CLI image with Prometheus for scenarios that need worker metric scraping | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly duped with the other docker file. Can this one just use that one as a layer, or otherwise dedupe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes likely
|
@picatz would you mind taking a look at this, double check i'm not leaking anything? |
…rate dockerfile used because existing cli.Dockerfile uses a distroless image)
…heus endpoint to optionally push metrics to provided S3 bucket
… fallback default values
…o/java is master and for core-based languages its main)
… sdkbuild issue (npm ci with no package-lock.json), remove previous npx pnpm change (wrong - that was for the runner...)
ecaaea1 to
6d67a08
Compare
|
(^ long commit blurb - rebased) |
|
Closing, moved to a private repo |
What was changed
Add
run-scenario-ecs.ymlGHA workflow.Can be used to manually dispatch worker performance tests that run on cloud (AWS Fargate).
Some characteristics:
is-experimentto distinguish from regular benchmark testing (useful for ad-hoc testing/experimenting, i.e. to test local perf-sensitive changes without disrupting signal from nightlies)Some improvements to be made in the future:
NOTE:
Leaving draft, to revert some changes before merging:
DURATIONshould default to5h(shortened for testing)TIMEOUTshould default to5h30m(shortened for testing)Checkout Omesbranch (currentlyper-version-taskdefs, will need to bemain)Why?
Useful to run worker performance tests
Numerous trial runs.