Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola: Bump default instance type to m5.large #1505

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

m4 is old and shouldn't be the default; this came up as
part of coreos/fedora-coreos-tracker#507

m4 is old and shouldn't be the default; this came up as
part of coreos/fedora-coreos-tracker#507
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member Author

CI here doesn't cover this, but I did

walters@toolbox /s/w/b/fcos> kola run -p aws --aws-region us-east-1 --aws-ami ami-00848c06968a080dd --aws-profile openshift-dev basic
=== RUN   basic
=== RUN   basic/ReadOnly
=== RUN   basic/Useradd
=== RUN   basic/MachineID
=== RUN   basic/PortSSH
=== RUN   basic/DbusPerms
=== RUN   basic/NetworkScripts
=== RUN   basic/ServicesActive

locally and ...hm, kola seems to have hung at the end. We really need more verbose logging of what's going on at the IaaS level. Retrying.

@cgwalters
Copy link
Member Author

@bgilbert do you know how to get more logging here offhand? I think this is something related to the flight teardown on aws.

@bgilbert
Copy link
Contributor

bgilbert commented Jun 3, 2020

@cgwalters Not offhand.

@arithx
Copy link
Contributor

arithx commented Jun 4, 2020

@bgilbert do you know how to get more logging here offhand? I think this is something related to the flight teardown on aws.

We could add some additional logging statements inside of the platform/api code (and maybe look into if there's verbose settings for the AWS library that we can pass through as well) but it would be difficult to determine which lines belong to which tests (the platform doesn't have knowledge the individual test suite atm). One potential solution would be to pass around loggers to the individual API level functions but that'd require a fair bit of rework (and wouldn't work for the non-mantle API logging pieces).

@dustymabe
Copy link
Member

i'm seeing this:

$ kola --build 32.20200615.dev.3 --output-dir tmp/kola run -p aws --aws-ami ami-095febd95fbe5b0b0 --aws-region us-east-1 -b fcos -j 5 --no-test-exit-error podman.base --blacklist-test fcos.internet --blacklist-test podman.workflow --aws-type m5.large
=== RUN   podman.base
--- FAIL: podman.base (1.07s)
        harness.go:842: Cluster failed starting machines: error running instances: Unsupported: Your requested instance type (m5.large) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not s
pecifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1e, us-east-1f.
        status code: 400, request id: 9433aa03-7a05-49e0-80ca-ee219cafee94
FAIL, output in tmp/kola

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

@dustymabe
Copy link
Member

I switched to us-east-2 and it passes:

[coreos-assembler]$ kola --build 32.20200615.dev.3 --output-dir tmp/kola run -p aws --aws-ami ami-0c1d26ba6e8183e0f --aws-region us-east-2 -b fcos -j 5 --no-test-exit-error podman.base --blacklist-test fcos.internet --blacklist-test podm
an.workflow --aws-type m5.large
=== RUN   podman.base
=== RUN   podman.base/info
=== RUN   podman.base/resources
2020-06-15T22:22:44Z platform/machine/aws: Error saving console for instance i-03dc573628cfbd697: retrieving console output of i-03dc573628cfbd697: time limit exceeded
--- PASS: podman.base (385.65s)
    --- PASS: podman.base/info (8.54s)
    --- PASS: podman.base/resources (13.18s)
            cluster.go:141: Getting image source signatures
            cluster.go:141: Copying blob sha256:64a13bc5735d834fa4c48be79a76dd097a4fa4de25a97dc5ed46ac5970da4462
            cluster.go:141: Copying config sha256:49f46bdee2ae71ba0ba12f8ff1ab6d444fb8ae97ed5e9332d1d008278d2beb58
            cluster.go:141: Writing manifest to image destination
            cluster.go:141: Storing signatures
            cluster.go:141: Your kernel does not support Block I/O weight or the cgroup is not mounted. Weight discarded.
PASS, output in tmp/kola

@arithx
Copy link
Contributor

arithx commented Jun 15, 2020

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

@dustymabe
Copy link
Member

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

Our code specifies the subnet to use which implies an availability zone. It turns out the code that picks which subnet to use consistently picks the same one (in my case for the fedora community AWS account it is us-east-1c). I was able to hack it up to force it to a different one for now, though not sure what we should do in the long run.

@cgwalters
Copy link
Member Author

us-east is the first and has a lot of older hardware.

@dustymabe
Copy link
Member

dustymabe commented Jun 16, 2020

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

Our code specifies the subnet to use which implies an availability zone. It turns out the code that picks which subnet to use consistently picks the same one (in my case for the fedora community AWS account it is us-east-1c). I was able to hack it up to force it to a different one for now, though not sure what we should do in the long run.

This is a long way of me saying that our code is making it so that the availability zone isn't randomly chosen across invocations on the same account. It might be random for different accounts, but if you stay in the same account and same region it looks to me (from my experience today) like you get the same availability zone every time.

@arithx
Copy link
Contributor

arithx commented Jun 16, 2020

Which is kind of weird. I don't see anywhere where we are setting it to us-east-1c.

Unless otherwise specified it will randomly choose availability zones for the given region. Seems like not all availability zones in us-east-1 have m5.large available in them.

Our code specifies the subnet to use which implies an availability zone. It turns out the code that picks which subnet to use consistently picks the same one (in my case for the fedora community AWS account it is us-east-1c). I was able to hack it up to force it to a different one for now, though not sure what we should do in the long run.

This is a long way of me saying that our code is making it so that the availability zone isn't randomly chosen across invocations on the same account. It might be random for different accounts, but if you stay in the same account and same region it looks to me (from my experience today) like you get the same availability zone every time.

Ah yeah; at a high level it looks like we might just be able to random shuffle the list but I'd have to test that assumption to make sure. We should be already creating subnets in every availability zone (that was available at the time of creation) in createSubnets.

That does still leave us in a weird spot though with some availability zones having differing hardware. I'm not sure that I have a good solution for that one though, I guess we could lookup the availability zone that was chosen and query the available resources (there's a DescribeInstanceTypeOfferings API call we can use [I can't link it because you can only view their EC2 API as raw text on github])

@dustymabe
Copy link
Member

my hacky solution to the problem that at least gets us unblocked: #1550

@cgwalters cgwalters closed this Jun 19, 2020
@cgwalters
Copy link
Member Author

Closing in favor of that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants