Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Steps to reproduce:
ScaleInIdlePeriod
to some number greater than 0 (I used 6 to speed up the feedback)TerminateInstanceAfterJob
to falseterminate-instance-in-auto-scaling-group
fails)Observed behaviour: After the ScaleInIdlePeriod the EC2 instance is running but the buildkite-agent is in the
SERVICE_STOPPED
state.Expected behaviour: the EC2 instance is running and the buildkite-agent is in
SERVICE_STARTED
state.Analysis
This bug was likely introduced in c3ebaa5
First it's important to understand how the service is configured in windows. We configure the default behaviour on exit to Restart with a 10s delay.
We also configure the terminate-instance script to run once the service stops.
Here's the sequence of events I've pieced together based on log outputs, the nssm source code and Windows documentation
buildkite-agent: Unexpected status SERVICE_START_PENDING in response to STOP control.
but queues the stop command.terminate-instance-in-auto-scaling-group
API call fails because the ASG is already at it's MinSizeSTART: An instance of the service is already running.
Changes
Don't attempt to stop the agent before terminating the instance, since it is asynchronous it doesn't complete before the start command is issued.
Because of the restart delay it's unlikely to start and pick up a job before the ASG can scale it down, so the stop is not necessary.