Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot restart Kubelet when there is no network connection #2567

Closed
vpineda1996 opened this issue Nov 9, 2022 · 4 comments · Fixed by #2587
Closed

Cannot restart Kubelet when there is no network connection #2567

vpineda1996 opened this issue Nov 9, 2022 · 4 comments · Fixed by #2587
Assignees
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW status/needs-triage Pending triage or re-evaluation

Comments

@vpineda1996
Copy link

Image I'm using:
bottlerocket-aws-k8s-1.21-x86_64-v1.8.0-a6233c22

What I expected to happen:
When I reboot my machine when there is no network connection, I am expecting kubelet to come back online or to at least see the process failing to start. If there is no network connection, I should see the kubelet running on the cached container.

What actually happened:
The container in which the kubelet is executed is not able to start because BR instructs containerd to call ECR to fetch the image rather than use the cached version.

Nov 08 20:18:19 ip-10-0-3-98.us-west-2.compute.internal host-ctr[1102632]: time="2022-11-08T20:18:19Z" level=info msg="pulling with Amazon ECR Resolver" ref="ecr.aws/arn:aws:ecr:us-west-2:193646904820:repository/eks/eks-distro/kubernetes/pause:v1.21.14-eks-1-21-19"
Nov 08 20:19:49 ip-10-0-3-98.us-west-2.compute.internal systemd[1]: kubelet.service: start-pre operation timed out. Terminating.
Nov 08 20:19:49 ip-10-0-3-98.us-west-2.compute.internal systemd[1]: kubelet.service: Control process exited, code=killed, status=15/TERM
Nov 08 20:19:49 ip-10-0-3-98.us-west-2.compute.internal systemd[1]: kubelet.service: Failed with result 'timeout'.
Nov 08 20:19:49 ip-10-0-3-98.us-west-2.compute.internal systemd[1]: Failed to start Kubelet.

How to reproduce the problem:

  1. Create a CPI or worker node with BR in AWS.
  2. Start SSM Session
  3. Remove network access to machine by removing all egress security group traffic. SSM session should continue to work at this point.
  4. Restart kubelet
  5. kubelet is never initalized.
@jpculp
Copy link
Member

jpculp commented Nov 11, 2022

Hi @vpineda1996, can you expand a bit on your use case? Network access is required to pull ECR credentials, but also pulling a fresh container on reboot resets you back to an unmodified state (excluding the files under the persistent storage locations).

@bcressey
Copy link
Contributor

@jpculp the attempt to pull the pause container via host-ctr doesn't complete before systemd gives up, which means that kubelet never gets started:

# Pull the pause container image before starting `kubelet` so `containerd/cri` wouldn't have to

There should be a better way to deal with this in the detached network case, where there's already a cached copy of the image on disk. Especially for the pause container, reusing the local copy if it exists should be good enough.

@vpineda1996
Copy link
Author

Hey @jpculp, I think @bcressey has an idea of what I want to achieve. I think I might have phrased my requirements incorrectly, making you believe that I wanted to use the same instantiation of the pause container after kubelet is restarted. This in fact NOT what I meant to say.

I want to reuse the cached pause image that was pulled and cached inside the host. That means that after the kubelet gets restarted, a new fresh container must be created but instead of trying to pull the image every single time, BR should be smart enough to use the "local" image if its present.

@vpineda1996
Copy link
Author

I submitted a similar fix for the EKS AMI. awslabs/amazon-eks-ami#1090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW status/needs-triage Pending triage or re-evaluation
Projects
Development

Successfully merging a pull request may close this issue.

5 participants