Ensure unhealthy Windows instances get marked correctly #614
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sometimes Windows instances fail to come up completely but are not marked unhealthy. It turns out sometimes the generated randomized password for the buildkite-user doesn't meet the password policy requirements and causes a failure of the
bk-install-elastic-stack.ps1
script. Normally that would trigger theon_error
trap which would mark the instance unhealthy and it would be replaced but unfortunately there is a bug in theon_error
trap that was preventing it from getting the instance ID. This leaves the instance running indefinitely but without a running buildkite-agent service which means jobs can wait indefinitely for an instance because the scaler lambda believes a healthy instance already exists.This PR fixes the bug in the
on_error
trap so a failure of thebk-install-elastic-stack.ps1
script can mark the instance unhealthy. It also adds a retry loop for randomizing the buildkite-agent user's password giving it a few extra chances to get a good password before failing.The PR also fixes the cloudformation by putting quotes around
%v
for a windows environment variable.