Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterctl upgrade Timed out waiting for all Machines to exist #11209

Closed
cahillsf opened this issue Sep 20, 2024 · 17 comments
Closed

clusterctl upgrade Timed out waiting for all Machines to exist #11209

cahillsf opened this issue Sep 20, 2024 · 17 comments
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@cahillsf
Copy link
Member

Which jobs are flaking?

periodic-cluster-api-e2e-main

Which tests are flaking?

When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers

specifically looking at this pattern:

  • 36 failures: Timed out waiting for all Machines to exist

    • Component: unknown
    • Branches:
      • main
      • release-1.8
      • release-1.7

Since when has it been flaking?

for quite some time

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

TLDR: seems like there is an issue in the Docker controller in creating the worker machines with many logs like:
dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state"
The failures follow different patterns

I think increasing clusterctl-upgrade/wait-worker-nodes e2e interval is a good first step as some of the examples below show the retry feedback in creating the DockerMachines appearing to resolve but the tests time out

the errors don't provide much info, but since the issue seems to stem from the docker container runtime RunContainer(

err = containerRuntime.RunContainer(ctx, runOptions, nil)
) call, we could explore passing in an output Writer here as the third parameter to get output directly from the container?

Example failure 1 (expected 2, found 1)

(When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers) prow link

capd controller logs show:

I0831 03:46:04.834929 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-q53g9g" "docker-cluster"="clusterctl-upgrade-workload-q53g9g" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-q53g9g-md-0-lfnm6"} "machine"="clusterctl-upgrade-workload-q53g9g-md-0-5fb876f4bc-cq8k6"

in this case the CP machine is fully initialized, but the MD worker machine status shows:

status:
  conditions:
  - lastTransitionTime: "2024-08-31T03:46:07Z"
    message: 0 of 2 completed
    reason: ContainerProvisioningFailed
    severity: Warning
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-31T03:46:07Z"
    message: Re-provisioning
    reason: ContainerProvisioningFailed
    severity: Warning
    status: "False"
    type: ContainerProvisioned

Example failure 2 (expected 2, found 1)

(When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers ) prow link

capd controller logs show:

I0831 04:16:20.583537 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-md-0-sr89b"} "machine"="clusterctl-upgrade-workload-merjuc-md-0-856f7cbb7c-fjfdl"

in this failed test, we also see several failed creations of the control plane machine:
I0831 04:12:46.050065 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"

most have the same error, the other CP machine creation failure looks like this:
I0831 04:13:25.895170 1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: command \"docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --hostname clusterctl-upgrade-workload-merjuc-control-plane-kw497 --network kind --name clusterctl-upgrade-workload-merjuc-control-plane-kw497 --label io.x-k8s.kind.cluster=clusterctl-upgrade-workload-merjuc --label io.x-k8s.kind.role=control-plane --expose 44139 --volume=/var/run/docker.sock:/var/run/docker.sock --publish=127.0.0.1:44139:6443/TCP kindest/node:v1.22.17@sha256:9af784f45a584f6b28bce2af84c494d947a05bd709151466489008f80a9ce9d5\" failed with error: exit status 125, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"

these appear to resolve and the provider ID gets set on the CP node:
I0831 04:15:55.173110 1 machine.go:337] controllers/DockerMachine/DockerMachine-controller "msg"="Setting Kubernetes node providerID" "cluster"="clusterctl-upgrade-workload-merjuc" "docker-cluster"="clusterctl-upgrade-workload-merjuc" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-merjuc-control-plane-76jpw"} "machine"="clusterctl-upgrade-workload-merjuc-control-plane-kw497"

in this case, the MD DockerMachine is still Waiting for BootstrapData

the CP DockerMachine only transitioned to Ready only 12 seconds before the test timed out:

  conditions:
  - lastTransitionTime: "2024-08-31T04:15:56Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-31T04:15:56Z"
    status: "True"
    type: BootstrapExecSucceeded

Example failure 3 (expected 2, found 0)

[It] When testing clusterctl upgrades (v0.3=>v1.5=>current) Should create a management cluster and then upgrade all the providers : prow link

in this case, it seems that MD is waiting for the control plane (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/resources/clusterctl-upgrade/DockerMachine/clusterctl-upgrade-workload-66mpos-md-0-ltqf6.yaml):

status:
  conditions:
  - lastTransitionTime: "2024-08-30T10:01:44Z"
    message: 0 of 2 completed
    reason: WaitingForControlPlaneAvailable
    severity: Info
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-30T10:01:44Z"
    reason: WaitingForControlPlaneAvailable
    severity: Info
    status: "False"
    type: ContainerProvisioned

looks like all the Failed to create worker DockerMachine errors in the CAPD controller logs in this case relate to the control plane: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829446304858640384/artifacts/clusters/clusterctl-upgrade-management-hywi70/logs/capd-system/capd-controller-manager/capd-controller-manager-fb4b578f9-rzpq9/manager.log:

I0830 10:04:39.240063       1 dockermachine_controller.go:220] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: timed out waiting for the condition, cleaning up so we can re-provision from a clean state" "cluster"="clusterctl-upgrade-workload-66mpos" "docker-cluster"="clusterctl-upgrade-workload-66mpos" "docker-machine"={"Namespace":"clusterctl-upgrade","Name":"clusterctl-upgrade-workload-66mpos-control-plane-jbbxg"} "machine"="clusterctl-upgrade-workload-66mpos-control-plane-8kvwp" 
  • it does seem that eventually the CP machine comes up, but it's still bootstrapping when the test times out at 10:06:52.444:
status:
  conditions:
  - lastTransitionTime: "2024-08-30T10:06:52Z"
    message: 1 of 2 completed
    reason: Bootstrapping
    severity: Info
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-30T10:06:52Z"
    reason: Bootstrapping
    severity: Info
    status: "False"
    type: BootstrapExecSucceeded
  - lastTransitionTime: "2024-08-30T10:06:52Z"
    status: "True"
    type: ContainerProvisioned

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2024
@sbueringer
Copy link
Member

sbueringer commented Sep 20, 2024

Am I understanding this correctly that "cleaning up so we can re-provision from a clean state" is coming from Docker?

To be honest, I'm not sure if we should just increase the timeout and then accept that we can only create Machines after multiple tries. This seems like just trying ignoring the root cause.

I would prefer if we could try to correlate this #11133 (comment) with the time when the number of flakes increased. If there is a correlation I would like to try to revert the concurrency change first.

I suspect that this is the root cause of many of our issues. If that's correct than it would be a lot better to fix that root cause then trying to get flakes green by retrying more often (or giving more time for retries).

Basically we already know that we encounter this issue frequently:

Failed to create control group inotify object: Too many open files

I think this can fail our tests in a lot of places.

@cahillsf
Copy link
Member Author

Am I understanding this correctly that "cleaning up so we can re-provision from a clean state" is coming from Docker?

Yes from what i can tell

To be honest, I'm not sure if we should just increase the timeout and then accept that we can only create Machines after multiple tries. This seems like just trying ignoring the root cause.

I would prefer if we could try to correlate this #11133 (comment) with the time when the number of flakes increased. If there is a correlation I would like to try to revert the concurrency change first.

yeah sure that sounds reasonable to me, will review again and try to pinpoint when this started happening (closed the timeout increase PR)

@sbueringer
Copy link
Member

If we don't have clear data I would be definitely also fine with rolling back the concurrency change and then checking if that significantly improves the flake situation.

@cahillsf
Copy link
Member Author

If we don't have clear data I would be definitely also fine with rolling back the concurrency change and then checking if that significantly improves the flake situation.

This particular message appears to predate the concurrency increase. Seems like it was present but quite rare up until 2024-07-31 when it begins to occur more frequently: https://storage.googleapis.com/k8s-triage/index.html?date=2024-08-07&text=Timed%20out%20waiting%20for%20all%20Machines%20to%20exist&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades

@sbueringer
Copy link
Member

So it happened before, but way more often since we increased concurrency?

@cahillsf
Copy link
Member Author

So it happened before, but way more often since we increased concurrency?

the trend seems to support it, but maybe not directly?

going back to 2024-08-19 before the concurrency change:
2 clusters of 8 failures out of 137618 builds from 8/4/2024, 8:00:04 PM to 8/19/2024, 3:39:37 PM. (0.005% frequency)

move to 2024-09-12 (a week after the concurrency change was backported to release-1.8):
2 clusters of 46 failures (7 in last day) out of 141326 builds from 8/28/2024, 8:00:25 PM to 9/12/2024, 4:09:32 PM. (0.03% frequency)

a bit further (the whole lookback window now encapsulates tests after the concurrency change) 2024-09-20:
2 clusters of 71 failures (8 in last day) out of 221929 builds from 9/5/2024, 8:00:29 PM to 9/20/2024, 3:59:38 PM. (0.03% frequency)

now today 2024-09-24 we have:
2 clusters of 116 failures (13 in last day) out of 192878 builds from 9/9/2024, 8:00:27 PM to 9/24/2024, 8:01:45 AM. (0.06% frequency)

@sbueringer
Copy link
Member

I'm not sure if there were also some infra migrations that might make the whole thing a bit fuzzy.

What do you think about reverting concurrency on main & release-1.8 for a week and then check if that gives us clear data?

@cahillsf
Copy link
Member Author

What do you think about reverting concurrency on main & release-1.8 for a week and then check if that gives us clear data?

sounds good to me

@sbueringer
Copy link
Member

You'll open the PRs, I approve? :)

@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 26, 2024
@sbueringer sbueringer added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Sep 26, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Sep 26, 2024
@sbueringer
Copy link
Member

sbueringer commented Sep 27, 2024

@cahillsf So far results seem to be pretty clear

image

https://storage.googleapis.com/k8s-triage/index.html?job=.*periodic-cluster-api-e2e

(even slightly better when release-1-6 is excluded which we didn't touch)

@cahillsf
Copy link
Member Author

yes they do indeed! is there some other follow up work we can pursue to reintroduce the greater concurrency without these negative impacts? or are we kind of "stuck" with 3 nodes?

i guess before introducing more variables, it might be worth it to "refresh" our understanding of the flakes that exist now that we have a cleaner signal to see if some of them might be actual bugs versus side effects of resource constraints in the test infra. what do you think is the best next step?

@sbueringer
Copy link
Member

sbueringer commented Sep 27, 2024

I would keep the concurrency as is for now. Let's see what flakes remain and then try to fix them.

I'm also not sure if we can do much to increase the concurrency (maybe we can slightly increase it without getting new flakes but not much). It seems like we're simply hitting limits of the infrastructure when we run too much in parallel (IIRC we're not even close to hitting CPU limits though).

In general, we really need pretty stable CI the next 1-2 months until code freeze. We have a few critical things upcoming that could lead to some flakes that might not be easy to spot (so we need good visibility for new flakes :)).

@cahillsf
Copy link
Member Author

@sbueringer taking another look at this message specifically, seems like we have 0 occurrences of this message since the concurrency was dropped back down: https://storage.googleapis.com/k8s-triage/index.html?date=2024-09-30&text=Timed%20out%20waiting%20for%20all%20Machines%20to%20exist&job=.*periodic-cluster-api-e2e.*&test=.*clusterctl%20upgrades

Screenshot 2024-09-30 at 6 59 29 PM

not sure if you want to give it some more time or we can just close this issue?

@sbueringer
Copy link
Member

Fine for me to close

@cahillsf
Copy link
Member Author

cahillsf commented Oct 1, 2024

Fine for me to close

cool -- thanks for all your help on this 🙇

/close

@k8s-ci-robot
Copy link
Contributor

@cahillsf: Closing this issue.

In response to this:

Fine for me to close

cool -- thanks for all your help on this 🙇

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants