-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The agent autoscaling group should never rebalance availability zones #751
Conversation
By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ. That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while). Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline. An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to. Some resources I found useful: 1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/ 2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05 3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html 4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
templates/aws-stack.yml
Outdated
- Effect: Allow | ||
Action: | ||
- 'autoscaling:SuspendProcesses' | ||
Resource: '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this OK? Is there a way to only give permission to the specific ASG in this stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think it should be possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this:
diff --git a/templates/aws-stack.yml b/templates/aws-stack.yml
index 40490c9..a8de337 100644
--- a/templates/aws-stack.yml
+++ b/templates/aws-stack.yml
@@ -959,7 +959,7 @@ Resources:
- Effect: Allow
Action:
- 'autoscaling:SuspendProcesses'
- Resource: '*'
+ Resource: !GetAtt AgentAutoScaleGroup.Arn
.. but sadly it seems the AWS::AutoScaling::AutoScalingGroup
type doesn't have an Arn property we can get:
$ STACK_NAME=jh-azrebalance7 aws-vault exec bk-sandbox-admin make create-stack
...
An error occurred (ValidationError) when calling the CreateStack operation: Template error: resource AgentAutoScaleGroup does not support attribute type Arn in Fn::GetAtt
make: *** [Makefile:114: create-stack] Error 255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws-cloudformation/cloudformation-coverage-roadmap#548
I would also like to see this implemented so we can write IAM policies that limit access to a specific autoscaling group - otherwise, there is no way to target a specific group (as far as I can tell there is no way to get the UUID part of the group ARN and IAM won't take a * there).
.. and then:
Correction - Targetting specific autoscaling groups by their "friendly name" can work if we use wildcards in place of the region and account id (instead of just leaving these empty - like we do with other resources such as S3 objects). To this does work:
!Sub arn:aws:autoscaling:*:*:autoScalingGroup:*:autoScalingGroupName/${LogicalGroupName}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh yeah, nothing but “friendly name”:
The wildcard approach is pretty legit, but you can be specific on some of them using pseudo parameters:
Resource: !Sub arn:${AWS::Partition}:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${AgentAutoScaleGroup}
Related: #700 |
This is great @yob, pleased at actually how brief the custom resource ends up being. |
Co-authored-by: Paul Annesley <paul@annesley.cc>
Co-authored-by: Paul Annesley <paul@annesley.cc>
8758e9c
to
db1040c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Nice 👍🏼 |
By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ.
That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while).
Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline.
An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to.
Some resources I found useful: