Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeNotReady test flakes on Release-1.3 test jobs #9379

Closed
nawazkh opened this issue Sep 6, 2023 · 15 comments
Closed

NodeNotReady test flakes on Release-1.3 test jobs #9379

nawazkh opened this issue Sep 6, 2023 · 15 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nawazkh
Copy link
Member

nawazkh commented Sep 6, 2023

Which jobs are flaking?

Which tests are flaking?

Not really applicable since the test never gets triggered.

Since when has it been flaking?

The exact start date is unknown, but this flake seems to be evident since Sept 1st.

  • Around Sept 1st

We migrated all nodes in the EKS Prow build cluster to a public subnet, so all nodes have public IP addresses instead of routing all traffic via a NAT Gateway.

Refer to: kubernetes/org#4433 (comment)
Refer to : https://kubernetes.slack.com/archives/C8TSNPY4T/p1694020825316969

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.3#capi-e2e-release-1-3

Reason for failure (if possible)

The test does not get triggered. The pod info, upon scrolling all the way down says Node not ready

Anything else we need to know?

This might be related to the thread going on in #sig-k8s-infra on Nodes are randomly freezing and failing 🧵

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 6, 2023
@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 7, 2023
@sbueringer
Copy link
Member

Would be good to link the Slack thread directly :)

@sbueringer
Copy link
Member

Did we talk to sig-k8s-infra and ask them if our problems could be related to their problem?

@nawazkh
Copy link
Member Author

nawazkh commented Sep 7, 2023

Would be good to link the Slack thread directly :)

Absolutely! I wasn't sure if it was ok to link a Slack discussion.

@nawazkh
Copy link
Member Author

nawazkh commented Sep 7, 2023

Did we talk to sig-k8s-infra and ask them if our problems could be related to their problem?

Thanks for bringing this up. I was not sure if we wanted to triage this on our end first.
But I posted a message on their thread to get it to their attention. https://kubernetes.slack.com/archives/CCK68P2Q2/p1694108045460259?thread_ts=1693476605.123389&cid=CCK68P2Q2

@sbueringer
Copy link
Member

sbueringer commented Sep 7, 2023

Good point, sorry I forgot what I said yesterday :). But I guess we're probably already at the point where we can't do much

@sbueringer
Copy link
Member

Any news? (I was on PTO the last few weeks)

@nawazkh
Copy link
Member Author

nawazkh commented Oct 25, 2023

I am following up on #sigs-k8s-infra over here https://kubernetes.slack.com/archives/CCK68P2Q2/p1698254656470629?thread_ts=1694458280.112599&cid=CCK68P2Q2

Thanks for the reminder!

@sbueringer
Copy link
Member

@ameukam Mentioned during KubeCon that they worked around the issue by using a different OS for the Nodes of the Prow cluster (IIRC). @nawazkh Any new occurences of this issue?

@sbueringer
Copy link
Member

@chrischdi
Copy link
Member

chrischdi commented Dec 18, 2023

FYI, still getting jobs with Pod got deleted unexpectedly from time to time.

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1736768378699255808

@sbueringer sbueringer self-assigned this Dec 19, 2023
@sbueringer
Copy link
Member

@chrischdi can we open a new issue and close this one? The title is highly misleading

@chrischdi
Copy link
Member

Created follow-up issue #9901

/close

Because release-1.3 jobs are gone.

@k8s-ci-robot
Copy link
Contributor

@chrischdi: Closing this issue.

In response to this:

Created follow-up issue #9901

/close

Because release-1.3 jobs are gone.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants