Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The agent autoscaling group should never rebalance availability zones #751

Merged
merged 5 commits into from
Oct 22, 2020

Conversation

yob
Copy link
Contributor

@yob yob commented Oct 21, 2020

By default, if the two availability zones in the agent ASG become significantly unbalanced, the ASG will terminate some instances in the larger AZ and start some new ones in the smaller AZ.

That's helpful for an ASG serving web requests, but it's not very helpful for our agent-shared workloads - the instance termination can disrupt running jobs. With the new lambda based scaler, each instance is responsible for terminating itself and the AZs become unbalanced very easily. That's not much of a problem though - the larger AZ is likely to reduce in size relatively soon, and subsequent scale-outs will restore the balance (for a while).

Sadly there's no way to suspend the AZRebalance process via cloudformation, so I held my nose and implemented it using a custom resource. It's not as ugly as I feared, mainly because it's possible to provide the required lambda function inline.

An alternative approach would be to have our buildkite-agent-scaler lambda check the AZRebalance status each time it loops and suspend the process if required. I thought this approach might be good enough for now, and we could try the scaler option down the track if we need to.

Some resources I found useful:

  1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/
  2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05
  3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html
  4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html

By default, if the two availability zones in the agent ASG become
significantly unbalanced, the ASG will terminate some instances in the
larger AZ and start some new ones in the smaller AZ.

That's helpful for an ASG serving web requests, but it's not very
helpful for our agent-shared workloads - the instance termination can
disrupt running jobs. With the new lambda based scaler, each instance is
responsible for terminating itself and the AZs become unbalanced very
easily. That's not much of a problem though - the larger AZ is likely to
reduce in size relatively soon, and subsequent scale-outs will restore
the balance (for a while).

Sadly there's no way to suspend the AZRebalance process via
cloudformation, so I held my nose and implemented it using a custom
resource. It's not as ugly as I feared, mainly because it's possible to
provide the required lambda function inline.

An alternative approach would be to have our buildkite-agent-scaler
lambda check the AZRebalance status each time it loops and suspend the
process if required. I thought this approach might be good enough for
now, and we could try the scaler option down the track if we need to.

Some resources I found useful:

1. https://www.alexdebrie.com/posts/cloudformation-custom-resources/
2. https://gist.github.com/atward/9573b9fbd3bfd6c453158c28356bec05
3. https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_SuspendProcesses.html
4. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
@yob
Copy link
Contributor Author

yob commented Oct 21, 2020

Here's a screenshot of the ASG in a newly created stack, built with this PR:

Screenshot from 2020-10-22 01-01-08

I've tested stack creation and deletion with the new custom resource, and both worked as expected.

- Effect: Allow
Action:
- 'autoscaling:SuspendProcesses'
Resource: '*'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this OK? Is there a way to only give permission to the specific ASG in this stack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think it should be possible!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this:

diff --git a/templates/aws-stack.yml b/templates/aws-stack.yml
index 40490c9..a8de337 100644
--- a/templates/aws-stack.yml
+++ b/templates/aws-stack.yml
@@ -959,7 +959,7 @@ Resources:
           - Effect: Allow
             Action:
             - 'autoscaling:SuspendProcesses'
-            Resource: '*'
+            Resource: !GetAtt AgentAutoScaleGroup.Arn

.. but sadly it seems the AWS::AutoScaling::AutoScalingGroup type doesn't have an Arn property we can get:

$ STACK_NAME=jh-azrebalance7 aws-vault exec bk-sandbox-admin make create-stack      
...

An error occurred (ValidationError) when calling the CreateStack operation: Template error: resource AgentAutoScaleGroup does not support attribute type Arn in Fn::GetAtt
make: *** [Makefile:114: create-stack] Error 255

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws-cloudformation/cloudformation-coverage-roadmap#548

I would also like to see this implemented so we can write IAM policies that limit access to a specific autoscaling group - otherwise, there is no way to target a specific group (as far as I can tell there is no way to get the UUID part of the group ARN and IAM won't take a * there).

.. and then:

Correction - Targetting specific autoscaling groups by their "friendly name" can work if we use wildcards in place of the region and account id (instead of just leaving these empty - like we do with other resources such as S3 objects). To this does work: !Sub arn:aws:autoscaling:*:*:autoScalingGroup:*:autoScalingGroupName/${LogicalGroupName}

Copy link
Member

@pda pda Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh yeah, nothing but “friendly name”:

image

The wildcard approach is pretty legit, but you can be specific on some of them using pseudo parameters:

Resource: !Sub arn:${AWS::Partition}:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${AgentAutoScaleGroup}

@yob
Copy link
Contributor Author

yob commented Oct 21, 2020

Related: #700

@lox
Copy link
Contributor

lox commented Oct 21, 2020

This is great @yob, pleased at actually how brief the custom resource ends up being.

templates/aws-stack.yml Outdated Show resolved Hide resolved
Co-authored-by: Paul Annesley <paul@annesley.cc>
templates/aws-stack.yml Outdated Show resolved Hide resolved
Co-authored-by: Paul Annesley <paul@annesley.cc>
templates/aws-stack.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@chloeruka chloeruka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@yob yob merged commit cc93588 into master Oct 22, 2020
@yob yob deleted the disable-az-rebalancing branch October 22, 2020 02:49
@pda
Copy link
Member

pda commented Oct 22, 2020

Nice 👍🏼

@yob yob mentioned this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants