-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Pod got deleted unexpectedly #6303
Comments
A support ticket has been created with AWS to investigate this issue |
We have a discussion on Slack about potential root cause https://kubernetes.slack.com/archives/CCK68P2Q2/p1705919163947889 |
We received a response from the AWS support and this is the most important bit:
They recommended us the following and it matches what we discussed with @tzneal yesterday:
Out of 7 instances that I provided to the AWS support, 6 of them were removed by the AZRebalance feature and one was removed by cluster-autoscaler. We'll:
|
The proposed mitigation has been rolled out to the production cluster. I propose leaving this issue open for 7 days to monitor if the issue is gone. We can use Prow's Deck for monitoring: https://prow.k8s.io/?state=error&cluster=eks-prow-build-cluster |
thanks a ton @xmudrii |
@xmudrii I see a test got stuck in eternal mode since yesterday which was one of the failure mode observed in the past issue here Test link that got triggered yesterday - https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16296/presubmit-kops-aws-scale-amazonvpc-using-cl2/1752113814377074688 Do you have any idea ? |
@hakuna-matatah The mentioned job got OOMKilled:
(c554cf25-37b0-4445-acb0-d09669adc3ea is coming from https://prow.k8s.io/prowjob?prowjob=c554cf25-37b0-4445-acb0-d09669adc3ea) I don't know why this didn't get reported back to Prow though. I recommend increasing memory requests and limits for this job. |
@xmudrii Thanks for getting back. I can increase the limits the memory as quick fix for now but however its weird that it didn't report it back to Prow in such case. Do you want me to open a issue for this particular case, or do we want to track that in here ? Just as an FYI - this particular |
This is an issue with Prow, while this ticket tracks instability of a build cluster. I recommend raising an issue in the k/test-infra repo about this. |
This issue has been fixed since we applied the mitigation on 2024-01-24. Given that the cluster has been stable for more than two weeks, I think we can close this issue. Thank y'all for patience while we figured this out! ❤️ |
@xmudrii: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We have reports of prowjobs deleted on EKS cluster
eks-prow-build-cluster
withPod got deleted unexpectedly
on community aws infrastructure kubernetes-sigs/cluster-api#9901/kind bug
/area infra/aws
/priority important-soon
/milestone v1.30
The text was updated successfully, but these errors were encountered: