Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prowjobs fail with Pod got deleted unexpectedly on community aws infrastructure #9901

Closed
chrischdi opened this issue Dec 19, 2023 · 9 comments
Labels
area/e2e-testing Issues or PRs related to e2e testing kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@chrischdi
Copy link
Member

Which jobs are failing?

Several periodics fail over time.

Failures between 12th December and 19th December 2023 (7days):

  • periodic-cluster-api-e2e-dualstack-and-ipv6-main: 5 failures the last 7 days (runs every 2h)
  • capi-e2e-main: 7 failures the last 7 days (runs every 2h)
  • capi-e2e-mink8s-main: 14 failures the last 7 days (runs every 2h)
  • capi-e2e-main-1-24-1-25: 1 failure the last 7 days (runs every 24h)
  • capi-e2e-main-1-26-1-27: 2 failures the last 7 days (runs every 24h)

So for the affected jobs this is a failure rate of ~ 10.9% (= 29/(3*7*12+2*7) because of this issue.

Which tests are failing?

No test or artifacts get reported for the affected job.

Since when has it been failing?

Started when moving jobs to the community owned AWS infrastructure.

The exact start date is unknown, but this flake seems to be evident since Sept 1st 2023.

Testgrid link

No response

Reason for failure (if possible)

The test job only says:

Job execution failed: Pod got deleted unexpectedly

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1736768378699255808

Anything else we need to know?

xref original issue where we started to move v1.3 jobs to the community owned AWS infrastructure:

We migrated all nodes in the EKS Prow build cluster to a public subnet, so all nodes have public IP addresses instead of routing all traffic via a NAT Gateway.

Refer to: kubernetes/org#4433 (comment)
Refer to : kubernetes.slack.com/archives/C8TSNPY4T/p1694020825316969

This might be related to the thread going on in #sig-k8s-infra on Nodes are randomly freezing and failing 🧵

Label(s) to be applied

/kind failing-test
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrischdi chrischdi changed the title Prowjobs fail with Pod got deleted unexpectedly Prowjobs fail with Pod got deleted unexpectedly on community aws infrastructure Dec 19, 2023
@sbueringer sbueringer added the area/e2e-testing Issues or PRs related to e2e testing label Dec 19, 2023
@sbueringer
Copy link
Member

@ameukam I think you mentioned that this was investigated again. Any news? :)

@ameukam
Copy link
Member

ameukam commented Dec 20, 2023

@ameukam I think you mentioned that this was investigated again. Any news? :)

We are still looking in to this with the EKS Support team. Will take some time to get the root cause.

@sbueringer
Copy link
Member

sbueringer commented Dec 20, 2023

Thank you!

Any upstream issue / Slack discussions we can follow?

@adilGhaffarDev
Copy link
Contributor

@ameukam any update on this fail? also as @sbueringer asked is there an upstream issue that we can follow.

@sbueringer
Copy link
Member

Does this still happen? (I just keep hearing that it's fixed :))

@adilGhaffarDev
Copy link
Contributor

Does this still happen? (I just keep hearing that it's fixed :))

yes, it is fixed, I don't see it on testgrid or on triage board. We can close this issue.

@fabriziopandini
Copy link
Member

/close
As per comment above

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
As per comment above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-testing Issues or PRs related to e2e testing kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

6 participants