kola: Bump default instance type to m5.large #1505

cgwalters · 2020-06-03T16:55:02Z

m4 is old and shouldn't be the default; this came up as
part of coreos/fedora-coreos-tracker#507

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

openshift-ci-robot · 2020-06-03T16:55:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2020-06-03T16:58:39Z

CI here doesn't cover this, but I did

walters@toolbox /s/w/b/fcos> kola run -p aws --aws-region us-east-1 --aws-ami ami-00848c06968a080dd --aws-profile openshift-dev basic
=== RUN   basic
=== RUN   basic/ReadOnly
=== RUN   basic/Useradd
=== RUN   basic/MachineID
=== RUN   basic/PortSSH
=== RUN   basic/DbusPerms
=== RUN   basic/NetworkScripts
=== RUN   basic/ServicesActive

locally and ...hm, kola seems to have hung at the end. We really need more verbose logging of what's going on at the IaaS level. Retrying.

cgwalters · 2020-06-03T17:06:35Z

@bgilbert do you know how to get more logging here offhand? I think this is something related to the flight teardown on aws.

bgilbert · 2020-06-03T17:08:51Z

@cgwalters Not offhand.

arithx · 2020-06-04T02:42:38Z

@bgilbert do you know how to get more logging here offhand? I think this is something related to the flight teardown on aws.

We could add some additional logging statements inside of the platform/api code (and maybe look into if there's verbose settings for the AWS library that we can pass through as well) but it would be difficult to determine which lines belong to which tests (the platform doesn't have knowledge the individual test suite atm). One potential solution would be to pass around loggers to the individual API level functions but that'd require a fair bit of rework (and wouldn't work for the non-mantle API logging pieces).

dustymabe · 2020-06-15T22:15:53Z

i'm seeing this:

$ kola --build 32.20200615.dev.3 --output-dir tmp/kola run -p aws --aws-ami ami-095febd95fbe5b0b0 --aws-region us-east-1 -b fcos -j 5 --no-test-exit-error podman.base --blacklist-test fcos.internet --blacklist-test podman.workflow --aws-type m5.large
=== RUN   podman.base
--- FAIL: podman.base (1.07s)
        harness.go:842: Cluster failed starting machines: error running instances: Unsupported: Your requested instance type (m5.large) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not s
pecifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1e, us-east-1f.
        status code: 400, request id: 9433aa03-7a05-49e0-80ca-ee219cafee94
FAIL, output in tmp/kola

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

dustymabe · 2020-06-15T22:28:46Z

I switched to us-east-2 and it passes:

[coreos-assembler]$ kola --build 32.20200615.dev.3 --output-dir tmp/kola run -p aws --aws-ami ami-0c1d26ba6e8183e0f --aws-region us-east-2 -b fcos -j 5 --no-test-exit-error podman.base --blacklist-test fcos.internet --blacklist-test podm
an.workflow --aws-type m5.large
=== RUN   podman.base
=== RUN   podman.base/info
=== RUN   podman.base/resources
2020-06-15T22:22:44Z platform/machine/aws: Error saving console for instance i-03dc573628cfbd697: retrieving console output of i-03dc573628cfbd697: time limit exceeded
--- PASS: podman.base (385.65s)
    --- PASS: podman.base/info (8.54s)
    --- PASS: podman.base/resources (13.18s)
            cluster.go:141: Getting image source signatures
            cluster.go:141: Copying blob sha256:64a13bc5735d834fa4c48be79a76dd097a4fa4de25a97dc5ed46ac5970da4462
            cluster.go:141: Copying config sha256:49f46bdee2ae71ba0ba12f8ff1ab6d444fb8ae97ed5e9332d1d008278d2beb58
            cluster.go:141: Writing manifest to image destination
            cluster.go:141: Storing signatures
            cluster.go:141: Your kernel does not support Block I/O weight or the cgroup is not mounted. Weight discarded.
PASS, output in tmp/kola

arithx · 2020-06-15T22:37:24Z

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

dustymabe · 2020-06-16T02:41:38Z

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

Our code specifies the subnet to use which implies an availability zone. It turns out the code that picks which subnet to use consistently picks the same one (in my case for the fedora community AWS account it is us-east-1c). I was able to hack it up to force it to a different one for now, though not sure what we should do in the long run.

cgwalters · 2020-06-16T02:46:52Z

us-east is the first and has a lot of older hardware.

dustymabe · 2020-06-16T02:54:22Z

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

Our code specifies the subnet to use which implies an availability zone. It turns out the code that picks which subnet to use consistently picks the same one (in my case for the fedora community AWS account it is us-east-1c). I was able to hack it up to force it to a different one for now, though not sure what we should do in the long run.

This is a long way of me saying that our code is making it so that the availability zone isn't randomly chosen across invocations on the same account. It might be random for different accounts, but if you stay in the same account and same region it looks to me (from my experience today) like you get the same availability zone every time.

arithx · 2020-06-16T04:32:00Z

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

Our code specifies the subnet to use which implies an availability zone. It turns out the code that picks which subnet to use consistently picks the same one (in my case for the fedora community AWS account it is us-east-1c). I was able to hack it up to force it to a different one for now, though not sure what we should do in the long run.

This is a long way of me saying that our code is making it so that the availability zone isn't randomly chosen across invocations on the same account. It might be random for different accounts, but if you stay in the same account and same region it looks to me (from my experience today) like you get the same availability zone every time.

Ah yeah; at a high level it looks like we might just be able to random shuffle the list but I'd have to test that assumption to make sure. We should be already creating subnets in every availability zone (that was available at the time of creation) in createSubnets.

That does still leave us in a weird spot though with some availability zones having differing hardware. I'm not sure that I have a good solution for that one though, I guess we could lookup the availability zone that was chosen and query the available resources (there's a DescribeInstanceTypeOfferings API call we can use [I can't link it because you can only view their EC2 API as raw text on github])

dustymabe · 2020-06-19T18:18:26Z

my hacky solution to the problem that at least gets us unblocked: #1550

cgwalters · 2020-06-19T18:20:30Z

Closing in favor of that

kola: Bump default instance type to m5.large

902ab83

m4 is old and shouldn't be the default; this came up as part of coreos/fedora-coreos-tracker#507

openshift-ci-robot requested review from jlebon and zonggen June 3, 2020 16:55

openshift-ci-robot added the approved label Jun 3, 2020

cgwalters mentioned this pull request Jun 3, 2020

aws: kola podman.base tests failing with kernel warning percpu_ref_switch_to_atomic_rcu+0x12f/0x140 coreos/fedora-coreos-tracker#507

Closed

cgwalters marked this pull request as draft June 3, 2020 17:50

openshift-ci-robot added the do-not-merge/work-in-progress label Jun 3, 2020

cgwalters closed this Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kola: Bump default instance type to m5.large #1505

kola: Bump default instance type to m5.large #1505

Uh oh!

cgwalters commented Jun 3, 2020

Uh oh!

openshift-ci-robot commented Jun 3, 2020

Uh oh!

cgwalters commented Jun 3, 2020

Uh oh!

cgwalters commented Jun 3, 2020

Uh oh!

bgilbert commented Jun 3, 2020

Uh oh!

arithx commented Jun 4, 2020

Uh oh!

dustymabe commented Jun 15, 2020

Uh oh!

dustymabe commented Jun 15, 2020

Uh oh!

arithx commented Jun 15, 2020

Uh oh!

dustymabe commented Jun 16, 2020

Uh oh!

cgwalters commented Jun 16, 2020

Uh oh!

dustymabe commented Jun 16, 2020 •

edited

Loading

Uh oh!

arithx commented Jun 16, 2020

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

cgwalters commented Jun 19, 2020

Uh oh!

Uh oh!

kola: Bump default instance type to m5.large #1505

kola: Bump default instance type to m5.large #1505

Uh oh!

Conversation

cgwalters commented Jun 3, 2020

Uh oh!

openshift-ci-robot commented Jun 3, 2020

Uh oh!

cgwalters commented Jun 3, 2020

Uh oh!

cgwalters commented Jun 3, 2020

Uh oh!

bgilbert commented Jun 3, 2020

Uh oh!

arithx commented Jun 4, 2020

Uh oh!

dustymabe commented Jun 15, 2020

Uh oh!

dustymabe commented Jun 15, 2020

Uh oh!

arithx commented Jun 15, 2020

Uh oh!

dustymabe commented Jun 16, 2020

Uh oh!

cgwalters commented Jun 16, 2020

Uh oh!

dustymabe commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arithx commented Jun 16, 2020

Uh oh!

dustymabe commented Jun 19, 2020

Uh oh!

cgwalters commented Jun 19, 2020

Uh oh!

Uh oh!

dustymabe commented Jun 16, 2020 •

edited

Loading