Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos is not respecting kubelet node shutdown timers when overriden in kubelet config. #7138

Closed
salkin opened this issue Apr 26, 2023 · 4 comments · Fixed by #7147
Closed
Assignees

Comments

@salkin
Copy link
Contributor

salkin commented Apr 26, 2023

Bug Report

Description

When specifying custom timers for Kubelet node shutdown, Talos is proceeding with termination of kubelet immediately and not giving Kubelet the amount of time specified.

Using this kubelet configuration:

   17     kubelet:
   18         image: {{registry}}/siderolabs/kubelet:v1.24.10 # The `image` field is an optional reference
      to an alternative kubelet image.
   19         # The `extraArgs` field is used to provide additional flags to the kubelet.
   20         # The `extraMounts` field is used to add additional mounts to the kubelet container.
   21         extraConfig:
   22           shutdownGracePeriod: "600s"
   23           shutdownGracePeriodCriticalPods: "100s"

Logs

Talos proceeds immediately to terminate Kubelet even there should be graceperiod given to kubelet.

[  594.840902] [talos] reboot via API received. actor id: 1f53bb52-9f43-4fd0-811d-6b3668e6e191
[  594.953649] [talos] reboot sequence: 12 phase(s)
[  595.021476] [talos] phase cleanup (1/12): 1 tasks(s)
[  595.093348] [talos] task stopAllPods (1/1): starting
[  595.165554] [talos] task stopAllPods (1/1): waiting for kubelet lifecycle finalizers
[  595.270355] [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth5", "ip": "192.168.253.100"}
[  595.488855] [talos] removed address 192.168.253.100/32 from "eth5" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
[  595.672733] [talos] node IP skipped, please use .machine.kubelet.nodeIP to provide explicit subnet for the node IP {"component": "controller-runtime", "controller": "k8s.NodeIPController", "address": "192.168.253.101"}
[  595.932535] [talos] task stopAllPods (1/1): shutting down kubelet gracefully
[  596.028710] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 57765, container kubelet)
[  596.484450] [talos] service[kubelet](Finished): Service finished successfully
[  596.606625] [talos] skipping pod monitoring/grafana-587b5655f4-wrlnw, state SANDBOX_NOTREADY
[  596.719247] [talos] skipping pod rook-ceph/rook-ceph-crashcollector-18-556-n-1-5f4d58d48-s7c82, state SANDBOX_NOTREADY

Environment

  • Talos version: v1.2.9
  • Kubernetes version: v1.24.10
  • Platform: Bare-metal
@smira
Copy link
Member

smira commented Apr 26, 2023

I'm pretty sure it works (it gets passed to the kubelet), I wonder if it's something in the kubelet which doesn't accept such values.

Talos itself ignores these values. If the kubelet itself got the values (it reports them on start on the log), there's nothing we can do on Talos side about it.

@smira
Copy link
Member

smira commented Apr 26, 2023

Ok, I found it, looks like on Talos side the max supported inhibit delay (that's the mechanism behind graceful shutdown) is hardcoded to be 60s. So anything above 60s won't work, as kubelet will deny it.

Not sure what exactly is the systemd default one, but seems to be around 30s.

@salkin
Copy link
Contributor Author

salkin commented Apr 27, 2023

Thanks @smira for clarifying. For me it's ok if we close the issue if you do not want to add validation checks?

@smira
Copy link
Member

smira commented Apr 27, 2023

I thought we could bump the default on Talos side. I don't think there's anything wrong about making the max higher

@smira smira self-assigned this Apr 27, 2023
smira added a commit to smira/talos that referenced this issue Apr 27, 2023
Fixes siderolabs#7138

This brings max shutdown period to 20 min that kubelet would accept.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
smira added a commit to smira/talos that referenced this issue Apr 27, 2023
Fixes siderolabs#7138

This brings max shutdown period to 20 min that kubelet would accept.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
smira added a commit to smira/talos that referenced this issue Apr 27, 2023
Fixes siderolabs#7138

This brings max shutdown period to 20 min that kubelet would accept.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
(cherry picked from commit 344746a)
salkin added a commit to nokia/talos that referenced this issue May 4, 2023
ensure to wait as long as possibly given to kubelet shutdown timers. Related to fix of siderolabs#7138

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
smira pushed a commit to smira/talos that referenced this issue May 4, 2023
Ensure to wait as long as possibly given to kubelet shutdown timers.
Related to fix of siderolabs#7138

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
smira pushed a commit to smira/talos that referenced this issue May 8, 2023
Ensure to wait as long as possibly given to kubelet shutdown timers.
Related to fix of siderolabs#7138

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
(cherry picked from commit 339986d)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants