kola: Bump default instance type to m5.large #1550

dustymabe · 2020-06-19T18:17:49Z

commit 7d1623a8d36d3084e3dd5a0eef3112ae1aed8808
Author: Colin Walters <walters@verbum.org>
Date:   Wed Jun 3 16:47:48 2020 +0000

    kola: Bump default instance type to m5.large
    
    m4 is old and shouldn't be the default; this came up as
    part of https://github.com/coreos/fedora-coreos-tracker/issues/507

commit af4b81290a8ed57cfcecad1d5b6c80e8c1ef598c
Author: Dusty Mabe <dusty@dustymabe.com>
Date:   Fri Jun 19 14:11:46 2020 -0400

    mantle: platform: don't start instances in "us-east-1c"
    
    That region doesn't support the m5.large hardware type which
    we want to use now.

dustymabe · 2020-06-19T18:18:07Z

this is a hacky solution but gets us unblocked.

cgwalters · 2020-06-19T18:19:46Z

I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that?

dustymabe · 2020-06-19T18:22:39Z

I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that?

I just kicked off a new round of tests that haven't completed yet. Going back and looking at my scrollback from earlier in the week it looks like the tests run fine but I do see some messages about getting the console output timing out:

--- PASS: coreos.ignition.ssh.key (346.45s)                                                                                                                                                                                                                                                              
2020-06-16T15:05:50Z platform/machine/aws: Error saving console for instance i-0b603cac7d65daca5: retrieving console output of i-0b603cac7d65daca5: time limit exceeded

Is that what you were seeing?

dustymabe · 2020-06-19T19:34:53Z

I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that?

I just kicked off a new round of tests that haven't completed yet. Going back and looking at my scrollback from earlier in the week it looks like the tests run fine but I do see some messages about getting the console output timing out:
--- PASS: coreos.ignition.ssh.key (346.45s)                                                                                                                                                                                                                                                              
2020-06-16T15:05:50Z platform/machine/aws: Error saving console for instance i-0b603cac7d65daca5: retrieving console output of i-0b603cac7d65daca5: time limit exceeded

This does appear to be a problem itself, though. As it appears the m5 instance types are much less reliable at giving back console output. After 5 minutes the console output still returns "" for a lot of the tests.

dustymabe · 2020-06-22T19:47:58Z

added a commit that bumps the timeout to 10 minutes for retrieving the console logs. I have another stashed change that would make it wait to terminate an instance until it got some initial console output but I don't know beneficial that would be and it would also mean that our instances would stay up much longer than they currently do ($$$).

diff --git a/mantle/platform/machine/aws/machine.go b/mantle/platform/machine/aws/machine.go
index cbab42e4..948a1ba1 100644
--- a/mantle/platform/machine/aws/machine.go
+++ b/mantle/platform/machine/aws/machine.go
@@ -77,7 +77,21 @@ func (am *machine) WaitForReboot(timeout time.Duration, oldBootId string) error
 }
 
 func (am *machine) Destroy() {
-       origConsole, err := am.cluster.flight.api.GetConsoleOutput(am.ID())
+       // Wait until we get some console output before we terminate. Otherwise we
+       // may terminate and never get any console output for this machine.
+       var origConsole string
+       err := util.WaitUntilReady(5*time.Minute, 10*time.Second, func() (bool, error) {
+               var err error
+               origConsole, err = am.cluster.flight.api.GetConsoleOutput(am.ID())
+               if err != nil {
+                       return false, err
+               }
+               if origConsole == "" {
+                       plog.Debugf("waiting for pre-terminate console for %v", am.ID())
+                       return false, nil
+               }
+               return true, nil
+       })
        if err != nil {
                plog.Warningf("Error retrieving console log for %v: %v", am.ID(), err)
        }

dustymabe · 2020-06-22T19:48:51Z

note that now that I'm properly able to retrieve console output it turns out the m5 instances are also showing the failure: coreos/fedora-coreos-tracker#507 (comment)

arithx · 2020-06-29T21:30:00Z

mantle/platform/api/aws/network.go

@@ -319,6 +319,11 @@ func (a *API) getSubnetID(vpc string) (string, error) {
 	}
 	for _, id := range subIds.Subnets {
 		if id.SubnetId != nil {
+			// Skip `us-east-1c` because m5 instances aren't available there.


Rather than hard coding a blacklist I'd prefer to see us lookup the type availability of the availability zone and cross reference that versus the requested type.

Only start instance types in availability zones that support that instance type. Otherwise we could get an error like: ``` Unsupported: Your requested instance type (m5.large) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1e, us-east-1f. ```

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

The m5.large instance types seem to consistently take longer for their console logs to get updated.

dustymabe · 2020-10-21T19:02:50Z

OK looked at this recently in the context of coreos/fedora-coreos-tracker#606 (comment). Moving to m5.large does get us passing tests, so let's migrate.

New set of commits:

commit c7151313e4df2b925e6549f506556bbd8edf7bf5 (HEAD -> dusty-bump-to-m5large, origin/dusty-bump-to-m5large)
Author: Dusty Mabe <dusty@dustymabe.com>
Date:   Mon Jun 22 15:41:13 2020 -0400

    mantle: bump timeout for aws console
    
    The m5.large instance types seem to consistently take longer for their
    console logs to get updated.

commit 336366d0cf188b0c1d05b7b3cad822d7f0a7d8df
Author: Colin Walters <walters@verbum.org>
Date:   Wed Jun 3 16:47:48 2020 +0000

    kola: Bump default instance type to m5.large
    
    m4 is old and shouldn't be the default; this came up as
    part of https://github.com/coreos/fedora-coreos-tracker/issues/507

commit 33040e9fa2075496947de9e71a9eff8735e27039
Author: Dusty Mabe <dusty@dustymabe.com>
Date:   Fri Jun 19 14:11:46 2020 -0400

    mantle: platform: ec2: add instance-type availability zone check
    
    Only start instance types in availability zones that support that
    instance type. Otherwise we could get an error like:
    
    ```
    Unsupported: Your requested instance type (m5.large) is not supported
    in your requested Availability Zone (us-east-1c). Please retry your
    request by not specifying an Availability Zone or choosing us-east-1a,
    us-east-1b, us-east-1d, us-east-1e, us-east-1f.
    ```

@arithx I think I addressed the concern you had about availabilty zone checking.

arithx

/approve

cgwalters

/lgtm

openshift-ci-robot · 2020-10-21T20:09:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arithx, cgwalters, dustymabe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [arithx,cgwalters,dustymabe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from arithx and LorbusChris June 19, 2020 18:17

openshift-ci-robot added the approved label Jun 19, 2020

dustymabe mentioned this pull request Jun 19, 2020

kola: Bump default instance type to m5.large #1505

Closed

dustymabe marked this pull request as draft June 19, 2020 19:35

openshift-ci-robot added the do-not-merge/work-in-progress label Jun 19, 2020

dustymabe force-pushed the dusty-bump-to-m5large branch from 7d1623a to 312f39c Compare June 22, 2020 19:44

arithx reviewed Jun 29, 2020

View reviewed changes

dustymabe and others added 3 commits October 21, 2020 14:57

kola: Bump default instance type to m5.large

336366d

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

mantle: bump timeout for aws console

c715131

The m5.large instance types seem to consistently take longer for their console logs to get updated.

dustymabe force-pushed the dusty-bump-to-m5large branch from 312f39c to c715131 Compare October 21, 2020 18:59

dustymabe marked this pull request as ready for review October 21, 2020 19:10

openshift-ci-robot removed the do-not-merge/work-in-progress label Oct 21, 2020

arithx approved these changes Oct 21, 2020

View reviewed changes

cgwalters reviewed Oct 21, 2020

View reviewed changes

openshift-ci-robot assigned cgwalters Oct 21, 2020

openshift-ci-robot added the lgtm label Oct 21, 2020

openshift-merge-robot merged commit 302bdda into coreos:master Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kola: Bump default instance type to m5.large #1550

kola: Bump default instance type to m5.large #1550

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

cgwalters commented Jun 19, 2020 •

edited

Loading

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

dustymabe commented Jun 22, 2020

Uh oh!

dustymabe commented Jun 22, 2020

Uh oh!

arithx Jun 29, 2020

Uh oh!

dustymabe commented Oct 21, 2020 •

edited

Loading

Uh oh!

arithx left a comment

Uh oh!

cgwalters left a comment

Uh oh!

openshift-ci-robot commented Oct 21, 2020

Uh oh!

Uh oh!

kola: Bump default instance type to m5.large #1550

kola: Bump default instance type to m5.large #1550

Uh oh!

Conversation

dustymabe commented Jun 19, 2020

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

cgwalters commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

dustymabe commented Jun 22, 2020

Uh oh!

dustymabe commented Jun 22, 2020

Uh oh!

arithx Jun 29, 2020

Choose a reason for hiding this comment

Uh oh!

dustymabe commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arithx left a comment

Choose a reason for hiding this comment

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Oct 21, 2020

Uh oh!

Uh oh!

cgwalters commented Jun 19, 2020 •

edited

Loading

dustymabe commented Oct 21, 2020 •

edited

Loading