Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws: Graceful handling of EC2 detach errors #10740

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
aws: Graceful handling of EC2 detach errors
Sometimes, we observe the following error during a rolling update:

error detaching instance "i-XXXX", node "ip-10-X-X-X.ec2.internal": error detaching instance "i-XXXX": ValidationError: The instance i-XXXX is not part of Auto Scaling group XXXXX

The sequence of events that lead to this problem is the following:

- A new ASG object is being built from the launch template
- Existing instances are being added to it
- An existing instance is being ignored because it's already terminating
W0205 08:01:32.593377     191 aws_cloud.go:791] ignoring instance as it is terminating: i-XXXX in autoscaling group: XXXX
- Due to maxSurge, the terminating instance is trying to be detached
  from the autoscaling group and fails.

As such, in case of EC@ ASG deatch failures we can simply try to detach
the next node instead of aborting the whole update operation.
  • Loading branch information
hwoarang committed Mar 5, 2021
commit 0a49650c70c1c9ed05a0aa85ec9aea3add823afa
8 changes: 6 additions & 2 deletions pkg/instancegroups/instancegroups.go
Original file line number Diff line number Diff line change
Expand Up @@ -134,11 +134,15 @@ func (c *RollingUpdateCluster) rollingUpdateInstanceGroup(group *cloudinstances.
update = prioritizeUpdate(update)

if maxSurge > 0 && !c.CloudOnly {
skippedNodes := 0
for numSurge := 1; numSurge <= maxSurge; numSurge++ {
u := update[len(update)-numSurge]
u := update[len(update)-numSurge+skippedNodes]
if u.Status != cloudinstances.CloudInstanceStatusDetached {
if err := c.detachInstance(u); err != nil {
return err
// If detaching a node fails, we simply proceed to the next one instead of
// bubbling up the error.
skippedNodes++
numSurge--
}

// If noneReady, wait until after one node is detached and its replacement validates
Expand Down