- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Kubelet should be aware of node shutdown and trigger graceful shutdown of pods during a machine shutdown.
Users and cluster administrators expect that pods will adhere to expected pod lifecycle including pod termination. Currently, when a node shuts down, pods do not follow the expected pod termination lifecycle and are not terminated gracefully which can cause issues for some workloads. This KEP aims to address this problem by making the kubelet aware of the underlying node shutdown. Kubelet will propagate this signal to pods ensuring they can shutdown as gracefully as possible.
- Make kubelet aware of underlying node shutdown event and trigger pod termination with sufficient grace period to shutdown properly
- Handle node shutdown in cloud-provider agnostic way
- Introduce minimal shutdown delay in order to shutdown node soon as possible (but not sooner)
- Focus on handling shutdown on systemd based machines
- Let users modify or change existing pod lifecycle or introduce new inner pod depencides / shutdown ordering
- Support every linux init and ACPI event handling mechanism (focus on widely used logind from systemd)
- Provide guarantee to handle all cases of graceful node shutdown, for example abrupt shutdown or sudden power cable pull can’t result in graceful shutdown
- As a cluster administrator, I can configure the nodes in my cluster to allocate X seconds for my pods to terminate gracefully during a node shutdown
- As a developer I can expect that my pods will terminate gracefully during node shutdowns
In the context of this KEP, shutdown is referred to as shutdown of the underlying machine. On most linux distros shutdown can be initiated via a variety of methods for example:
shutdown -h now
shutdown -h +30
#schedule a delayed shutdown in 30mins
systemctl poweroff
- Physically pressing the power button on the machine
- If a machine is a VM, the underlying hypervisor can press the “virtual” power button
- For a cloud instance, stopping the instance via Cloud API, e.g. via
gcloud compute instances stop
. Depending on the cloud provider, this may result in virtual power button press by the underlying hypervisor.
Note: The use of shutdown -h now
is dependent on systemd version. This is explored in Github issue #124039
Some of these cases will involve the machine receiving an ACPI event to change
the power state. The machine can go from G0 (working state) to G2 (Soft Off)
and finally to G3 (Off) more info on ACPI
states.
On Linux, prior to shutdown usually a system daemon will listen to these events
and perform some series of actions prior to userspace calling the
reboot(2) systemcall
with LINUX_REBOOT_CMD_POWER_OFF
or LINUX_REBOOT_CMD_HALT
to actually
shutdown the machine.
Historically, ACPI events were often handled by the
acpid daemon which uses a variety
of mechanisms to watch ACPI events (i.e. reading /proc/acpi/event
or
/dev/input/eventX
to react to power button presses). However, in most modern
linux distros today,
systemd-logind
has taken over as the main component reacting to ACPI
events and
initiating shutdown of the machine. On a system with
systemd-logind, for example, a trigger of the power button will
result in the systemd target
poweroff
being run (see
HandlePowerKey,
which will terminate all the systemd services running on the machine and
eventually shut it down. However, in the context of kubernetes, systemd is not
aware of the pods and containers running on the machine and systemd will simply
kill them as regular linux processes.
systemd-logind
provides the ability for applications to delay shutdown and
perform some series of actions before the shutdown completes through a
mechanism called "Inhibitor
Locks".
Applications can request to delay shutdown by taking an inhibitor lock by
sending messages to logind over dbus. Applications can request up to
InhibitDelayMaxSec
(a setting configured in logind.conf
) for delay based
locks, which allow applications to receive sleep and shutdown events, and block
the shutdown from proceeding by InhibitDelayMaxSec
period to execute some
critical work prior to shutdown/sleep. Inhibitor Locks were introduced to
systemd 183 (released in 2012).
We believe that making use of systemd is a reasonable approach considering almost all new popular linux distros are systemd based (RHEL, Google COS, Ubuntu, CentOS, Debian, Fedora, Flatcar Linux, see widespread adoption) and systemd 183 (released in 2012) features support for inhibitors.
Thanks to @giuseppe for helping with getting systemd inhibitors working!
Introduce a new Kubelet Config setting, kubeletConfig.ShutdownGracePeriod
,
defaulting to 0 seconds. Upon kubelet startup,
- if the setting is greater than 0 seconds
- kubelet will check with dbus current
InhibitDelayMaxSec
to check ifkubeletConfig.ShutdownGracePeriod <=
InhibitDelayMaxSec
.
- kubelet will check with dbus current
- if
kubeletConfig.ShutdownGracePeriod
>InhibitDelayMaxSec
- Kubelet will attempt to update the InhibitDelayMaxSec setting, by
writing a config file to
/etc/systemd/logind.conf.d/kubelet.conf
, and sending a SIGHUP to logind to update the config setting to ensure that the ShutdownGracePeriod from kubelet config is equal toInhibitDelayMaxSec
.
- Kubelet will attempt to update the InhibitDelayMaxSec setting, by
writing a config file to
After updating the InhibitDelayMaxSec
on the node if needed, Kubelet will
query the dbus for the final value of InhibitDelayMaxSec
set on the node and
treat min(InhibitDelayMaxSec
, kubeletConfig.ShutdownGracePeriod
) as the
allocatable shutdown grace period, which will be referred to in this KEP as
ShutdownGracePeriod
.
Kubelet will register with dbus as a delay systemd inhibitor lock for the
ShutdownGracePeriod
for the shutdown event. Kubelet will also register a
PrepareForShutdown
signal which will be emitted prior to the shutdown. Upon
receiving the signal, Kubelet will have additional ShutdownGracePeriod
time
before the actual node will initiate the shutdown.
Handling the shutdown
Upon a shutdown occurring, Kubelet will gracefully terminate all the pods
running on the node and update the Ready condition of the node to false with a
message Node Shutting Down
, thereby ensuring new workloads will not get
scheduled to the node.
Since some of the pods running on the node are often critical for the the
workloads running on a node (e.g. logging pod daemonset, kubeproxy, kubedns)
etc, we choose to split the pods running on the node into two categories,
“critical system pods”, and regular pods. Critical system pods should be
terminated last, because for example, if the logging pod is terminated first,
logs from the other workloads will not be captured. Critical system pods are
identified as those that are in the system-cluster-critical
or
system-node-critical
priority
classes.
Upon shutdown Kubelet will:
- Update the Node's
Ready
condition tofalse
, with the reasonNode is shutting down
- Gracefully terminate all non critical system pods with a gracePeriodOverride
computed as
min(podSpec.terminationGracePeriodSeconds, ShutdownGracePeriod-ShutdownGracePeriodCriticalPods)
- Gracefully terminate all critical system pods with gracePeriodOverride of
ShutdownGracePeriodCriticalPods
seconds
Kubelet will use the same existing
killPod
function to perform the termination of pods, using gracePeriodOverride
to set
the appropriate grace period. During the termination process, normal pod
termination
processes will apply, e.g. preStop Hooks will be called, SIGTERM
to containers
delivered, etc.
To ensure gracePeriodOverride
is respected, Github issue
#92432 should also be
addressed to ensure that gracePeriod
override will be respected for preStop
hooks.
POC: I’ve prototyped an initial POC
here of the proposed
implementation on the shutdown
branch.
- Kubelet does not receive shutdown event or is able to create inhibitor lock
- Mitigation: Kubelet does not provide graceful shutdown to pods (same as today’s existing behavior). For alpha stage, to track shutdown behavior and if it was successful, we plan to add a debugging log statement just prior to kubelet's shutdown process being completed, so it's possible to verify if kubelet shutdown the node gracefully.
- Kubelet is unable to update
InhibitDelayMaxSec
in logind to match that ofkubeletConfig.ShutdownGracePeriod
- If there are multiple logind configuration file overrides in
/etc/systemd/logind.conf.d/
, logind will use the config file with the lexicographically latest name. As a result in rare cases, the kubelet’s InhibitDelayMaxSec conf file override may be overwritten by another config file (possibly placed by another service on the machine). - Mitigation: Kubelet will use current value of
InhibitDelayMaxSec
from logind as the shutdown period which may be less thankubeletConfig.ShutdownGracePeriod
.
- If there are multiple logind configuration file overrides in
- OS / Distro does not use systemd or systemd version < 183
- Mitigation: Kubelet will not provide graceful shutdown to pods (same as today’s existing behavior).
The design proposes adding a new KubeletConfig field ShutdownGracePeriod
used
to specify total time period kubelet should delay shutdown by and thus total time
allocated to the graceful termination process.
In addition to ShutdownGracePeriod
, another KubeletConfig field will be added
ShutdownGracePeriodCriticalPods
. During the shutdown, the
ShutdownGracePeriod-ShutdownGracePeriodCriticalPods
duration will be grace
period for non critical system pods like user workloads, while the remaining
time of ShutdownGracePeriodCriticalPods
will be the grace period for critical
pods like node logging daemonsets.
type KubeletConfiguration struct {
...
ShutdownGracePeriod metav1.Duration
ShutdownGracePeriodCriticalPods metav1.Duration
}
Communication with systemd over dbus for (creating inhibitor lock, receiving
PrepareForShutdown
callback, etc), will make use of the
github.com/godbus/dbus/v5
package which is already included in
vendor/
.
Termination of pods will make use of the existing
killPod function
from the kubelet
package and specify the appropriate gracePeriodOverride
as
necessary.
- Unit tests for kubelet of handling shutdown event
- New E2E tests to validate node graceful shutdown (note limitation that K8S
E2E tests currently only run on GCE).
- Shutdown grace period unspecified, feature is not active
- Pod’s ExecStop and SIGTERM handlers are given gracePeriodSeconds for case when gracePeriodSeconds <= kubeletConfig.ShutdownGracePeriod
- Pod’s ExecStop and SIGTERM handlers are given kubeletConfig.ShutdownGracePeriod for case when gracePeriodSeconds > kubeletConfig.ShutdownGracePeriod
- Implemented the feature for Linux (systemd) only
- Unit tests
- Unit tests will mock out system components (i.e. systemd, inhibitors) for alpha
- Investigate how e2e tests can be implemented (e.g. may need to create fake shutdown event)
- Addresses feedback from alpha testers
- Sufficient E2E and unit testing
- Addresses feedback from beta
- Sufficient number of users using the feature
- Confident that no further API / kubelet config configuration options changes are needed
- Close on any remaining open issues & bugs
n/a
n/a
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
GracefulNodeShutdown
- Components depending on the feature gate:
kubelet
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane?
- no
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Config
feature is enabled).- yes (will require restart of kubelet)
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here.
- The main behavior change is that during a node shutdown, pods running on the node will be terminated gracefully.
-
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Also set
disable-supported
totrue
orfalse
inkep.yaml
. Describe the consequences on existing workloads (e.g., if this is a runtime feature, can it break the existing applications?).- Yes, the feature can be disabled by either disabling the feature gate, or
setting
kubeletConfig.ShutdownGracePeriod
to 0 seconds.
- Yes, the feature can be disabled by either disabling the feature gate, or
setting
-
What happens if we reenable the feature if it was previously rolled back?
- Kubelet will attempt to perform graceful termination of pods during a node shutdown.
-
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified.
- n/a
This section must be completed when targeting beta graduation to a release.
- How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?
This feature should not impact rollouts.
- What specific metrics should inform a rollback?
N/A.
- Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now.
The feature is part of kubelet config so updating kubelet config should enable/disable the feature; upgrade/downgrade is N/A.
- Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.
No.
This section must be completed when targeting beta graduation to a release.
- How can an operator determine if the feature is in use by workloads? Ideally, this should be a metric. Operations against the Kubernetes API (e.g., checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose.
Check if the feature gate and kubelet config settings are enabled on a node.
- What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
- Metrics
N/A
- What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
At a high level, this usually will be in the form of "high percentile of SLI
per day <= X". It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
N/A.
- Are there any missing metrics that would be useful to have to improve observability of this feature? Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.).
N/A.
This section must be completed when targeting beta graduation to a release.
-
Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.
For each of these, fill in the following—thinking about running existing user workloads and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
- Usage description:
- [Dependency name]
No, this feature doesn't depend on any specific services running the cluster. It only depends on systemd running on the node itself.
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.
- Will enabling / using this feature result in any new API calls?
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.)
No.
- Will enabling / using this feature result in introducing new API types?
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
No.
- Will enabling / using this feature result in any new calls to the cloud provider?
No.
- Will enabling / using this feature result in increasing size or count of
the existing API objects?
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
No.
- Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
No.
- Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data sent and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits.
No.
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
- How does this feature react if the API server and/or etcd is unavailable?
The feature does not depend on the API server / etcd.
-
What are other known failure modes? For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
- Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
- [Failure mode brief description]
-
What steps should be taken if SLOs are not being met to determine the problem?
N/A.
- 2020-05-26 - Original GH issue #91472 filed
- 2020-10-02 - Initial KEP approved
- 2020-11-12 - Initial Alpha implementation merged for k8s 1.20
- 2020-11-20 - Docs merged
- Use systemd cgroup driver to set
TimeoutStopSec=
on scopes underlying containers- Set
TimeStopSec=
for the container scopes using the value set in the pod for termination grace period. The problem with this approach is that systemd doesn’t understand the prestop hooks.
- Set
- Use systemd cgroup driver to set
Before=kubelet.service
on scopes underlying containers- Set
Before=kubelet.service
and container runtime service for the container scopes. Systemd would then stop the containers after the kubelet giving the kubelet a chance to stop the containers itself. This depends upon using the systemd cgroups driver and is coupled to systemd.
- Set
- Use systemd cgroup driver to set controller property on scope to delegate
control to kubelet
- Set Controller dbus property for the container scopes and set
After=kubelet.service
for the containers. Systemd would then signal the kubelet over dbus to delegate the container scope termination. This requires more work in the kubelet and is also coupled to systemd and the systemd cgroup driver.
- Set Controller dbus property for the container scopes and set
- Don’t handle node shutdown events at all, and have users drain nodes before
shutting them down.
- This is not always possible, for example if the shutdown is controlled by some external system (e.g. Preemptible VMs).
- Avoid relying on systemd and logind and directly hook into ACPI events on the
node.
- Unfortunately, this can create conflicts because only one systemd daemon should be monitoring ACPI events. Additionally, if the system is using systemd but kubelet did not integrate with it, systemd by default would terminate kubelet and other processes during a shutdown event.
- Provide more configuration options on how to split time during shutdown (e.g. split between critical pods and user workloads). Need more feedback from the community here.