The agent autoscaling group should never rebalance availability zones #751

yob · 2020-10-21T14:16:04Z

By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ.

That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while).

Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline.

An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to.

Some resources I found useful:

By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ. That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while). Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline. An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to. Some resources I found useful: 1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/ 2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05 3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html 4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html

yob · 2020-10-21T14:17:30Z

Here's a screenshot of the ASG in a newly created stack, built with this PR:

I've tested stack creation and deletion with the new custom resource, and both worked as expected.

yob · 2020-10-21T14:18:58Z

templates/aws-stack.yml

+          - Effect: Allow
+            Action:
+            - 'autoscaling:SuspendProcesses'
+            Resource: '*'


is this OK? Is there a way to only give permission to the specific ASG in this stack?

I would think it should be possible!

I tried this:

diff --git a/templates/aws-stack.yml b/templates/aws-stack.yml index 40490c9..a8de337 100644 --- a/templates/aws-stack.yml +++ b/templates/aws-stack.yml @@ -959,7 +959,7 @@ Resources: - Effect: Allow Action: - 'autoscaling:SuspendProcesses' - Resource: '*' + Resource: !GetAtt AgentAutoScaleGroup.Arn

.. but sadly it seems the AWS::AutoScaling::AutoScalingGroup type doesn't have an Arn property we can get:

$ STACK_NAME=jh-azrebalance7 aws-vault exec bk-sandbox-admin make create-stack ... An error occurred (ValidationError) when calling the CreateStack operation: Template error: resource AgentAutoScaleGroup does not support attribute type Arn in Fn::GetAtt make: *** [Makefile:114: create-stack] Error 255

aws-cloudformation/cloudformation-coverage-roadmap#548

I would also like to see this implemented so we can write IAM policies that limit access to a specific autoscaling group - otherwise, there is no way to target a specific group (as far as I can tell there is no way to get the UUID part of the group ARN and IAM won't take a * there).

.. and then:

Correction - Targetting specific autoscaling groups by their "friendly name" can work if we use wildcards in place of the region and account id (instead of just leaving these empty - like we do with other resources such as S3 objects). To this does work: !Sub arn:aws:autoscaling:*:*:autoScalingGroup:*:autoScalingGroupName/${LogicalGroupName}

Huh yeah, nothing but “friendly name”:

The wildcard approach is pretty legit, but you can be specific on some of them using pseudo parameters:

Resource: !Sub arn:${AWS::Partition}:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${AgentAutoScaleGroup}

yob · 2020-10-21T14:21:26Z

Related: #700

lox · 2020-10-21T21:31:48Z

This is great @yob, pleased at actually how brief the custom resource ends up being.

templates/aws-stack.yml

Co-authored-by: Paul Annesley <paul@annesley.cc>

templates/aws-stack.yml

Co-authored-by: Paul Annesley <paul@annesley.cc>

templates/aws-stack.yml

chloeruka

Looks good to me!

pda · 2020-10-22T02:55:03Z

Nice 👍🏼

yob commented Oct 21, 2020

View reviewed changes

target AZSuspenderRole to the specific ASG in the current stack

54aa790

pda reviewed Oct 22, 2020

View reviewed changes

templates/aws-stack.yml Outdated Show resolved Hide resolved

Update templates/aws-stack.yml

ada4eef

Co-authored-by: Paul Annesley <paul@annesley.cc>

pda reviewed Oct 22, 2020

View reviewed changes

templates/aws-stack.yml Outdated Show resolved Hide resolved

Update templates/aws-stack.yml

209cdf3

Co-authored-by: Paul Annesley <paul@annesley.cc>

pda reviewed Oct 22, 2020

View reviewed changes

templates/aws-stack.yml Outdated Show resolved Hide resolved

pda reviewed Oct 22, 2020

View reviewed changes

templates/aws-stack.yml Show resolved Hide resolved

YAML whitespace

db1040c

yob force-pushed the disable-az-rebalancing branch from 8758e9c to db1040c Compare October 22, 2020 02:36

chloeruka approved these changes Oct 22, 2020

View reviewed changes

yob merged commit cc93588 into master Oct 22, 2020

yob deleted the disable-az-rebalancing branch October 22, 2020 02:49

yob mentioned this pull request May 11, 2021

Max instance lifetime #839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The agent autoscaling group should never rebalance availability zones #751

The agent autoscaling group should never rebalance availability zones #751

yob commented Oct 21, 2020

yob commented Oct 21, 2020

yob Oct 21, 2020

lox Oct 21, 2020

yob Oct 22, 2020

yob Oct 22, 2020

pda Oct 22, 2020 •

edited

Loading

yob commented Oct 21, 2020

lox commented Oct 21, 2020

chloeruka left a comment

pda commented Oct 22, 2020

The agent autoscaling group should never rebalance availability zones #751

The agent autoscaling group should never rebalance availability zones #751

Conversation

yob commented Oct 21, 2020

yob commented Oct 21, 2020

yob Oct 21, 2020

Choose a reason for hiding this comment

lox Oct 21, 2020

Choose a reason for hiding this comment

yob Oct 22, 2020

Choose a reason for hiding this comment

yob Oct 22, 2020

Choose a reason for hiding this comment

pda Oct 22, 2020 • edited Loading

Choose a reason for hiding this comment

yob commented Oct 21, 2020

lox commented Oct 21, 2020

chloeruka left a comment

Choose a reason for hiding this comment

pda commented Oct 22, 2020

pda Oct 22, 2020 •

edited

Loading