Prowjobs fail with `Pod got deleted unexpectedly` on community aws infrastructure #9901

chrischdi · 2023-12-19T11:21:28Z

Which jobs are failing?

Several periodics fail over time.

Failures between 12th December and 19th December 2023 (7days):

periodic-cluster-api-e2e-dualstack-and-ipv6-main: 5 failures the last 7 days (runs every 2h)
capi-e2e-main: 7 failures the last 7 days (runs every 2h)
capi-e2e-mink8s-main: 14 failures the last 7 days (runs every 2h)
capi-e2e-main-1-24-1-25: 1 failure the last 7 days (runs every 24h)
capi-e2e-main-1-26-1-27: 2 failures the last 7 days (runs every 24h)

So for the affected jobs this is a failure rate of ~ 10.9% (= 29/(3*7*12+2*7) because of this issue.

Which tests are failing?

No test or artifacts get reported for the affected job.

Since when has it been failing?

Started when moving jobs to the community owned AWS infrastructure.

The exact start date is unknown, but this flake seems to be evident since Sept 1st 2023.

Testgrid link

No response

Reason for failure (if possible)

The test job only says:

Job execution failed: Pod got deleted unexpectedly

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1736768378699255808

Anything else we need to know?

xref original issue where we started to move v1.3 jobs to the community owned AWS infrastructure:

NodeNotReady test flakes on Release-1.3 test jobs #9379

We migrated all nodes in the EKS Prow build cluster to a public subnet, so all nodes have public IP addresses instead of routing all traffic via a NAT Gateway.

Refer to: kubernetes/org#4433 (comment)
Refer to : kubernetes.slack.com/archives/C8TSNPY4T/p1694020825316969

This might be related to the thread going on in #sig-k8s-infra on Nodes are randomly freezing and failing 🧵

Label(s) to be applied

/kind failing-test
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-12-19T11:21:36Z

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-12-19T11:37:48Z

@ameukam I think you mentioned that this was investigated again. Any news? :)

ameukam · 2023-12-20T08:50:59Z

@ameukam I think you mentioned that this was investigated again. Any news? :)

We are still looking in to this with the EKS Support team. Will take some time to get the root cause.

sbueringer · 2023-12-20T13:21:19Z

Thank you!

Any upstream issue / Slack discussions we can follow?

adilGhaffarDev · 2024-01-16T07:45:31Z

@ameukam any update on this fail? also as @sbueringer asked is there an upstream issue that we can follow.

sbueringer · 2024-03-28T17:59:43Z

Does this still happen? (I just keep hearing that it's fixed :))

adilGhaffarDev · 2024-03-28T18:22:04Z

Does this still happen? (I just keep hearing that it's fixed :))

yes, it is fixed, I don't see it on testgrid or on triage board. We can close this issue.

fabriziopandini · 2024-03-29T19:44:39Z

/close
As per comment above

k8s-ci-robot · 2024-03-29T19:44:44Z

@fabriziopandini: Closing this issue.

In response to this:

/close
As per comment above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2023

chrischdi mentioned this issue Dec 19, 2023

NodeNotReady test flakes on Release-1.3 test jobs #9379

Closed

chrischdi changed the title ~~Prowjobs fail with Pod got deleted unexpectedly~~ Prowjobs fail with Pod got deleted unexpectedly on community aws infrastructure Dec 19, 2023

sbueringer added the area/e2e-testing Issues or PRs related to e2e testing label Dec 19, 2023

ameukam mentioned this issue Jan 22, 2024

AWS: Pod got deleted unexpectedly kubernetes/k8s.io#6303

Closed

k8s-ci-robot closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prowjobs fail with `Pod got deleted unexpectedly` on community aws infrastructure #9901

Prowjobs fail with `Pod got deleted unexpectedly` on community aws infrastructure #9901

chrischdi commented Dec 19, 2023

k8s-ci-robot commented Dec 19, 2023

sbueringer commented Dec 19, 2023

ameukam commented Dec 20, 2023

sbueringer commented Dec 20, 2023 •

edited

Loading

adilGhaffarDev commented Jan 16, 2024

sbueringer commented Mar 28, 2024

adilGhaffarDev commented Mar 28, 2024

fabriziopandini commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

Prowjobs fail with Pod got deleted unexpectedly on community aws infrastructure #9901

Prowjobs fail with Pod got deleted unexpectedly on community aws infrastructure #9901

Comments

chrischdi commented Dec 19, 2023

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

k8s-ci-robot commented Dec 19, 2023

sbueringer commented Dec 19, 2023

ameukam commented Dec 20, 2023

sbueringer commented Dec 20, 2023 • edited Loading

adilGhaffarDev commented Jan 16, 2024

sbueringer commented Mar 28, 2024

adilGhaffarDev commented Mar 28, 2024

fabriziopandini commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

Prowjobs fail with `Pod got deleted unexpectedly` on community aws infrastructure #9901

Prowjobs fail with `Pod got deleted unexpectedly` on community aws infrastructure #9901

sbueringer commented Dec 20, 2023 •

edited

Loading