-
Notifications
You must be signed in to change notification settings - Fork 179
kola: Bump default instance type to m5.large #1550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kola: Bump default instance type to m5.large #1550
Conversation
dustymabe
commented
Jun 19, 2020
this is a hacky solution but gets us unblocked. |
I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that? |
I just kicked off a new round of tests that haven't completed yet. Going back and looking at my scrollback from earlier in the week it looks like the tests run fine but I do see some messages about getting the console output timing out:
Is that what you were seeing? |
This does appear to be a problem itself, though. As it appears the m5 instance types are much less reliable at giving back console output. After 5 minutes the console output still returns |
7d1623a
to
312f39c
Compare
added a commit that bumps the timeout to 10 minutes for retrieving the console logs. I have another stashed change that would make it wait to terminate an instance until it got some initial console output but I don't know beneficial that would be and it would also mean that our instances would stay up much longer than they currently do ($$$). diff --git a/mantle/platform/machine/aws/machine.go b/mantle/platform/machine/aws/machine.go
index cbab42e4..948a1ba1 100644
--- a/mantle/platform/machine/aws/machine.go
+++ b/mantle/platform/machine/aws/machine.go
@@ -77,7 +77,21 @@ func (am *machine) WaitForReboot(timeout time.Duration, oldBootId string) error
}
func (am *machine) Destroy() {
- origConsole, err := am.cluster.flight.api.GetConsoleOutput(am.ID())
+ // Wait until we get some console output before we terminate. Otherwise we
+ // may terminate and never get any console output for this machine.
+ var origConsole string
+ err := util.WaitUntilReady(5*time.Minute, 10*time.Second, func() (bool, error) {
+ var err error
+ origConsole, err = am.cluster.flight.api.GetConsoleOutput(am.ID())
+ if err != nil {
+ return false, err
+ }
+ if origConsole == "" {
+ plog.Debugf("waiting for pre-terminate console for %v", am.ID())
+ return false, nil
+ }
+ return true, nil
+ })
if err != nil {
plog.Warningf("Error retrieving console log for %v: %v", am.ID(), err)
} |
note that now that I'm properly able to retrieve console output it turns out the m5 instances are also showing the failure: coreos/fedora-coreos-tracker#507 (comment) |
mantle/platform/api/aws/network.go
Outdated
@@ -319,6 +319,11 @@ func (a *API) getSubnetID(vpc string) (string, error) { | |||
} | |||
for _, id := range subIds.Subnets { | |||
if id.SubnetId != nil { | |||
// Skip `us-east-1c` because m5 instances aren't available there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than hard coding a blacklist I'd prefer to see us lookup the type availability of the availability zone and cross reference that versus the requested type.
Only start instance types in availability zones that support that instance type. Otherwise we could get an error like: ``` Unsupported: Your requested instance type (m5.large) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1e, us-east-1f. ```
m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507
The m5.large instance types seem to consistently take longer for their console logs to get updated.
312f39c
to
c715131
Compare
OK looked at this recently in the context of coreos/fedora-coreos-tracker#606 (comment). Moving to New set of commits:
@arithx I think I addressed the concern you had about availabilty zone checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: arithx, cgwalters, dustymabe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |