Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flakes in clusterctl upgrade tests #11133

Open
cahillsf opened this issue Sep 3, 2024 · 9 comments
Open

flakes in clusterctl upgrade tests #11133

cahillsf opened this issue Sep 3, 2024 · 9 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@cahillsf
Copy link
Member

cahillsf commented Sep 3, 2024

summarized by @chrischdi 🙇


According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.

  • 36 failures: Timed out waiting for all Machines to exist split off into: clusterctl upgrade Timed out waiting for all Machines to exist #11209

  • 16 Failures: Failed to create kind cluster

    • Component: e2e setup
    • Branches:
      • main
      • release-1.7
  • 14 Failures: Internal error occurred: failed calling webhook [...] connect: connection refused

    • Component: CAPD
    • Branches:
      • main
      • release-1.8
  • 7 Failures: x509: certificate signed by unknown authority

    • Component: unknown
    • Branches:
      • main
      • release-1.8
      • release-1.7
  • 5 Failures: Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas

    • Component: unknown
    • Branches:
      • release-1.8
      • main
  • 2 Failures: Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision

    • Component: unknown
    • Branches:
      • release-1.8
      • main

Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here

/kind flake

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 3, 2024
@tormath1
Copy link

tormath1 commented Sep 4, 2024

I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: kubernetes-sigs/kind#3554 - I guess it's something to fix upstream.

EDIT: It seems to be an issue with inodes:

$ kind create cluster --retain --name=cluster3
Creating cluster "cluster3" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
$ podman logs -f 7eb0838e6bb2
...
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 12 (bookworm)!

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

@chrischdi
Copy link
Member

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files

That sounds very suspicious regarding to
https://main.cluster-api.sigs.k8s.io/user/troubleshooting.html?highlight=sysctl#cluster-api-with-docker----too-many-open-files

Maybe would be a good start here to collect data about the actual used values :-)

@BenTheElder
Copy link
Member

"cluster": "eks-prow-build-cluster",

I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not

Also perhaps something else on the cluster is using a lot of them.

@ameukam
Copy link
Member

ameukam commented Sep 4, 2024

I confirm the daemonset runs on the EKS cluster.

@fabriziopandini fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 5, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Sep 5, 2024
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 5, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 5, 2024
@fabriziopandini fabriziopandini added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 5, 2024
@tormath1
Copy link

tormath1 commented Sep 5, 2024

Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs1 of a failing test:

$ cat journal.log | grep -i "Journal started"
Aug 30 06:35:27 clusterctl-upgrade-management-fba3o1-control-plane systemd-journald[95]: Journal started
$ cat journal.log | grep -i "multi-user"
Aug 30 06:35:51 clusterctl-upgrade-management-fba3o1-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.

While on a non failing setup:

root@kind-control-plane:/# journalctl | grep -i "multi-user"
Sep 05 12:16:31 kind-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.
root@kind-control-plane:/# journalctl | grep -i "Journal started"
Sep 05 12:16:31 kind-control-plane systemd-journald[98]: Journal started

We can see that the multi-user.target2 is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout3 for reaching the multi-user.target hence the failure.

Footnotes

  1. https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1829400754293575680/artifacts/clusters/clusterctl-upgrade-management-fba3o1/logs-kind/clusterctl-upgrade-management-fba3o1-control-plane/journal.log

  2. https://github.com/kubernetes-sigs/kind/blob/52394ea8a92eed848d086318e983697f4a5afa93/pkg/cluster/internal/providers/common/cgroups.go#L44

  3. https://github.com/kubernetes-sigs/kind/blob/52394ea8a92eed848d086318e983697f4a5afa93/pkg/cluster/internal/providers/docker/provision.go#L414

@BenTheElder
Copy link
Member

It's possible? This part shouldn't really take long though ..

I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?)

Doesn't explain the inotify-exhaustion like failures.

@sbueringer
Copy link
Member

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

@tormath1
Copy link

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August.

@sbueringer
Copy link
Member

main: #11067
release-1.8: #11144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: No status
Development

No branches or pull requests

8 participants