Talos is not respecting kubelet node shutdown timers when overriden in kubelet config. #7138

salkin · 2023-04-26T13:02:02Z

Bug Report

Description

When specifying custom timers for Kubelet node shutdown, Talos is proceeding with termination of kubelet immediately and not giving Kubelet the amount of time specified.

Using this kubelet configuration:

   17     kubelet:
   18         image: {{registry}}/siderolabs/kubelet:v1.24.10 # The `image` field is an optional reference
      to an alternative kubelet image.
   19         # The `extraArgs` field is used to provide additional flags to the kubelet.
   20         # The `extraMounts` field is used to add additional mounts to the kubelet container.
   21         extraConfig:
   22           shutdownGracePeriod: "600s"
   23           shutdownGracePeriodCriticalPods: "100s"

Logs

Talos proceeds immediately to terminate Kubelet even there should be graceperiod given to kubelet.

[  594.840902] [talos] reboot via API received. actor id: 1f53bb52-9f43-4fd0-811d-6b3668e6e191
[  594.953649] [talos] reboot sequence: 12 phase(s)
[  595.021476] [talos] phase cleanup (1/12): 1 tasks(s)
[  595.093348] [talos] task stopAllPods (1/1): starting
[  595.165554] [talos] task stopAllPods (1/1): waiting for kubelet lifecycle finalizers
[  595.270355] [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth5", "ip": "192.168.253.100"}
[  595.488855] [talos] removed address 192.168.253.100/32 from "eth5" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
[  595.672733] [talos] node IP skipped, please use .machine.kubelet.nodeIP to provide explicit subnet for the node IP {"component": "controller-runtime", "controller": "k8s.NodeIPController", "address": "192.168.253.101"}
[  595.932535] [talos] task stopAllPods (1/1): shutting down kubelet gracefully
[  596.028710] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 57765, container kubelet)
[  596.484450] [talos] service[kubelet](Finished): Service finished successfully
[  596.606625] [talos] skipping pod monitoring/grafana-587b5655f4-wrlnw, state SANDBOX_NOTREADY
[  596.719247] [talos] skipping pod rook-ceph/rook-ceph-crashcollector-18-556-n-1-5f4d58d48-s7c82, state SANDBOX_NOTREADY

Environment

Talos version: v1.2.9
Kubernetes version: v1.24.10
Platform: Bare-metal

The text was updated successfully, but these errors were encountered:

smira · 2023-04-26T13:05:05Z

I'm pretty sure it works (it gets passed to the kubelet), I wonder if it's something in the kubelet which doesn't accept such values.

Talos itself ignores these values. If the kubelet itself got the values (it reports them on start on the log), there's nothing we can do on Talos side about it.

smira · 2023-04-26T14:03:12Z

Ok, I found it, looks like on Talos side the max supported inhibit delay (that's the mechanism behind graceful shutdown) is hardcoded to be 60s. So anything above 60s won't work, as kubelet will deny it.

Not sure what exactly is the systemd default one, but seems to be around 30s.

salkin · 2023-04-27T10:46:46Z

Thanks @smira for clarifying. For me it's ok if we close the issue if you do not want to add validation checks?

smira · 2023-04-27T10:48:30Z

I thought we could bump the default on Talos side. I don't think there's anything wrong about making the max higher

Fixes siderolabs#7138 This brings max shutdown period to 20 min that kubelet would accept. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>

Fixes siderolabs#7138 This brings max shutdown period to 20 min that kubelet would accept. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com> (cherry picked from commit 344746a)

ensure to wait as long as possibly given to kubelet shutdown timers. Related to fix of siderolabs#7138 Signed-off-by: Niklas Wik <niklas.wik@nokia.com>

Ensure to wait as long as possibly given to kubelet shutdown timers. Related to fix of siderolabs#7138 Signed-off-by: Niklas Wik <niklas.wik@nokia.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>

Ensure to wait as long as possibly given to kubelet shutdown timers. Related to fix of siderolabs#7138 Signed-off-by: Niklas Wik <niklas.wik@nokia.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com> (cherry picked from commit 339986d)

smira self-assigned this Apr 27, 2023

smira added a commit to smira/talos that referenced this issue Apr 27, 2023

fix: bump max inhibit delay to 20 min

c887caa

Fixes siderolabs#7138 This brings max shutdown period to 20 min that kubelet would accept. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>

smira mentioned this issue Apr 27, 2023

fix: bump max inhibit delay to 20 min #7147

Merged

smira added a commit to smira/talos that referenced this issue Apr 27, 2023

fix: bump max inhibit delay to 20 min

fb0f85a

Fixes siderolabs#7138 This brings max shutdown period to 20 min that kubelet would accept. Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>

talos-bot closed this as completed in 344746a Apr 27, 2023

salkin added a commit to nokia/talos that referenced this issue May 4, 2023

fix: inhibit timer to follow kubelet timer

3cee02f

ensure to wait as long as possibly given to kubelet shutdown timers. Related to fix of siderolabs#7138 Signed-off-by: Niklas Wik <niklas.wik@nokia.com>

salkin mentioned this issue May 4, 2023

fix: inhibit timer to follow kubelet timer #7173

Closed

smira mentioned this issue May 4, 2023

fix: inhibit timer to follow kubelet timer #7176

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Talos is not respecting kubelet node shutdown timers when overriden in kubelet config. #7138

Talos is not respecting kubelet node shutdown timers when overriden in kubelet config. #7138

salkin commented Apr 26, 2023

smira commented Apr 26, 2023

smira commented Apr 26, 2023 •

edited

Loading

salkin commented Apr 27, 2023

smira commented Apr 27, 2023

Talos is not respecting kubelet node shutdown timers when overriden in kubelet config. #7138

Talos is not respecting kubelet node shutdown timers when overriden in kubelet config. #7138

Comments

salkin commented Apr 26, 2023

Bug Report

Description

Logs

Environment

smira commented Apr 26, 2023

smira commented Apr 26, 2023 • edited Loading

salkin commented Apr 27, 2023

smira commented Apr 27, 2023

smira commented Apr 26, 2023 •

edited

Loading