Skip to content

kola: Bump default instance type to m5.large #1550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 21, 2020

Conversation

dustymabe
Copy link
Member

commit 7d1623a8d36d3084e3dd5a0eef3112ae1aed8808
Author: Colin Walters <walters@verbum.org>
Date:   Wed Jun 3 16:47:48 2020 +0000

    kola: Bump default instance type to m5.large
    
    m4 is old and shouldn't be the default; this came up as
    part of https://github.com/coreos/fedora-coreos-tracker/issues/507

commit af4b81290a8ed57cfcecad1d5b6c80e8c1ef598c
Author: Dusty Mabe <dusty@dustymabe.com>
Date:   Fri Jun 19 14:11:46 2020 -0400

    mantle: platform: don't start instances in "us-east-1c"
    
    That region doesn't support the m5.large hardware type which
    we want to use now.

@dustymabe
Copy link
Member Author

this is a hacky solution but gets us unblocked.

@cgwalters
Copy link
Member

cgwalters commented Jun 19, 2020

I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that?

@dustymabe
Copy link
Member Author

I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that?

I just kicked off a new round of tests that haven't completed yet. Going back and looking at my scrollback from earlier in the week it looks like the tests run fine but I do see some messages about getting the console output timing out:

--- PASS: coreos.ignition.ssh.key (346.45s)                                                                                                                                                                                                                                                              
2020-06-16T15:05:50Z platform/machine/aws: Error saving console for instance i-0b603cac7d65daca5: retrieving console output of i-0b603cac7d65daca5: time limit exceeded

Is that what you were seeing?

@dustymabe
Copy link
Member Author

I was getting tests running fine but it was hanging at some point in teardown, are you not seeing that?

I just kicked off a new round of tests that haven't completed yet. Going back and looking at my scrollback from earlier in the week it looks like the tests run fine but I do see some messages about getting the console output timing out:

--- PASS: coreos.ignition.ssh.key (346.45s)                                                                                                                                                                                                                                                              
2020-06-16T15:05:50Z platform/machine/aws: Error saving console for instance i-0b603cac7d65daca5: retrieving console output of i-0b603cac7d65daca5: time limit exceeded

This does appear to be a problem itself, though. As it appears the m5 instance types are much less reliable at giving back console output. After 5 minutes the console output still returns "" for a lot of the tests.

@dustymabe
Copy link
Member Author

added a commit that bumps the timeout to 10 minutes for retrieving the console logs. I have another stashed change that would make it wait to terminate an instance until it got some initial console output but I don't know beneficial that would be and it would also mean that our instances would stay up much longer than they currently do ($$$).

diff --git a/mantle/platform/machine/aws/machine.go b/mantle/platform/machine/aws/machine.go
index cbab42e4..948a1ba1 100644
--- a/mantle/platform/machine/aws/machine.go
+++ b/mantle/platform/machine/aws/machine.go
@@ -77,7 +77,21 @@ func (am *machine) WaitForReboot(timeout time.Duration, oldBootId string) error
 }
 
 func (am *machine) Destroy() {
-       origConsole, err := am.cluster.flight.api.GetConsoleOutput(am.ID())
+       // Wait until we get some console output before we terminate. Otherwise we
+       // may terminate and never get any console output for this machine.
+       var origConsole string
+       err := util.WaitUntilReady(5*time.Minute, 10*time.Second, func() (bool, error) {
+               var err error
+               origConsole, err = am.cluster.flight.api.GetConsoleOutput(am.ID())
+               if err != nil {
+                       return false, err
+               }
+               if origConsole == "" {
+                       plog.Debugf("waiting for pre-terminate console for %v", am.ID())
+                       return false, nil
+               }
+               return true, nil
+       })
        if err != nil {
                plog.Warningf("Error retrieving console log for %v: %v", am.ID(), err)
        }

@dustymabe
Copy link
Member Author

note that now that I'm properly able to retrieve console output it turns out the m5 instances are also showing the failure: coreos/fedora-coreos-tracker#507 (comment)

@@ -319,6 +319,11 @@ func (a *API) getSubnetID(vpc string) (string, error) {
}
for _, id := range subIds.Subnets {
if id.SubnetId != nil {
// Skip `us-east-1c` because m5 instances aren't available there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hard coding a blacklist I'd prefer to see us lookup the type availability of the availability zone and cross reference that versus the requested type.

dustymabe and others added 3 commits October 21, 2020 14:57
Only start instance types in availability zones that support that
instance type. Otherwise we could get an error like:

```
Unsupported: Your requested instance type (m5.large) is not supported
in your requested Availability Zone (us-east-1c). Please retry your
request by not specifying an Availability Zone or choosing us-east-1a,
us-east-1b, us-east-1d, us-east-1e, us-east-1f.
```
m4 is old and shouldn't be the default; this came up as
part of coreos/fedora-coreos-tracker#507
The m5.large instance types seem to consistently take longer for their
console logs to get updated.
@dustymabe dustymabe force-pushed the dusty-bump-to-m5large branch from 312f39c to c715131 Compare October 21, 2020 18:59
@dustymabe
Copy link
Member Author

dustymabe commented Oct 21, 2020

OK looked at this recently in the context of coreos/fedora-coreos-tracker#606 (comment). Moving to m5.large does get us passing tests, so let's migrate.

New set of commits:

commit c7151313e4df2b925e6549f506556bbd8edf7bf5 (HEAD -> dusty-bump-to-m5large, origin/dusty-bump-to-m5large)
Author: Dusty Mabe <dusty@dustymabe.com>
Date:   Mon Jun 22 15:41:13 2020 -0400

    mantle: bump timeout for aws console
    
    The m5.large instance types seem to consistently take longer for their
    console logs to get updated.

commit 336366d0cf188b0c1d05b7b3cad822d7f0a7d8df
Author: Colin Walters <walters@verbum.org>
Date:   Wed Jun 3 16:47:48 2020 +0000

    kola: Bump default instance type to m5.large
    
    m4 is old and shouldn't be the default; this came up as
    part of https://github.com/coreos/fedora-coreos-tracker/issues/507

commit 33040e9fa2075496947de9e71a9eff8735e27039
Author: Dusty Mabe <dusty@dustymabe.com>
Date:   Fri Jun 19 14:11:46 2020 -0400

    mantle: platform: ec2: add instance-type availability zone check
    
    Only start instance types in availability zones that support that
    instance type. Otherwise we could get an error like:
    
    ```
    Unsupported: Your requested instance type (m5.large) is not supported
    in your requested Availability Zone (us-east-1c). Please retry your
    request by not specifying an Availability Zone or choosing us-east-1a,
    us-east-1b, us-east-1d, us-east-1e, us-east-1f.
    ```

@arithx I think I addressed the concern you had about availabilty zone checking.

@dustymabe dustymabe marked this pull request as ready for review October 21, 2020 19:10
Copy link
Contributor

@arithx arithx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arithx, cgwalters, dustymabe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [arithx,cgwalters,dustymabe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 302bdda into coreos:master Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants