diff --git a/docs/usage/shoot_high_availability_best_practices.md b/docs/usage/shoot_high_availability_best_practices.md index 5d63bcefb4d..b39b41191f4 100644 --- a/docs/usage/shoot_high_availability_best_practices.md +++ b/docs/usage/shoot_high_availability_best_practices.md @@ -296,7 +296,6 @@ spec: featureGates: MinDomainsInPodTopologySpread: true kubeControllerManager: - nodeMonitorPeriod: 10s nodeMonitorGracePeriod: 40s horizontalPodAutoscaler: syncPeriod: 15s @@ -374,7 +373,7 @@ Please note, these settings replace `spec.kubernetes.kubeControllerManager.podEv Required to be enabled for `minDomains` to work with PTSCs (beta since Kubernetes `v1.25`, but off by default). See [above](#pod-topology-spread-constraints) and the [docs](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#topologyspreadconstraints-field). This tells the scheduler, how many topology domains to expect (=zones in the context of this document). -#### On `spec.kubernetes.kubeControllerManager.nodeMonitorPeriod` and `nodeMonitorGracePeriod` +#### On `spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod` This is another very interesting [kube-controller-manager setting](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager) that can help you speed up or slow down how fast a node shall be considered `Unknown` (node status unknown, a.k.a unreachable) when the `kubelet` is not updating its status anymore (see [node status conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition)), which effects eviction (see `spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds` and `defaultNotReadyTolerationSeconds` above). The shorter the time window, the faster Kubernetes will act, but the higher the chance of flapping behavior and pod trashing, so you may want to balance that out according to your needs, otherwise stick to the default which is a reasonable compromise. @@ -390,9 +389,13 @@ This configures vertical pod autoscaling in Gardener-managed clusters. See [abov This configures node auto-scaling in Gardener-managed clusters. See [above](#worker-pools) and the [docs](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md) for the detailed fields, especially about [expanders](https://github.com/gardener/autoscaler/blob/machine-controller-manager-provider/cluster-autoscaler/FAQ.md#what-are-expanders), which may become life-saving in case of a zone outage when a resource crunch is setting in and everybody rushes to get machines in the healthy zones. -In case of a zone outage, it may be interesting to understand how the cluster autoscaler will put a worker pool in one zone into "back-off". Unfortunately, the official cluster autoscaler documentation does not explain these details, but you can find hints in the [source code](https://github.com/kubernetes/autoscaler/blob/b94f340af58eb063df9ebfcd65835f9a499a69a2/cluster-autoscaler/config/autoscaling_options.go#L214-L219): +In case of a zone outage, it is critical to understand how the cluster autoscaler will put a worker pool in one zone into "back-off" and what the consequences for your workload will be. Unfortunately, the official cluster autoscaler documentation does not explain these details, but you can find hints in the [source code](https://github.com/kubernetes/autoscaler/blob/b94f340af58eb063df9ebfcd65835f9a499a69a2/cluster-autoscaler/config/autoscaling_options.go#L214-L219): -If a node fails to come up, the node group (worker pool in that zone) will go into "back-off", at first 5m, then [exponentially longer](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/utils/backoff/exponential_backoff.go#L77-L82) until the maximum of 30m is reached. The "back-off" is reset after 3 hours. This in turn means, that nodes must be first considered `Unknown`, which happens when `spec.kubernetes.kubeControllerManager.nodeMonitorPeriod.nodeMonitorGracePeriod` lapses. Then they must either remain in this state until `spec.provider.workers.machineControllerManager.machineHealthTimeout` lapses for them to be recreated, which will fail in the unhealthy zone, or `spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds` lapses for the pods to be evicted (usually faster than node replacements, depending on your configuration), which will trigger the cluster autoscaler to create more capacity, but very likely in the same zone as it tries to balance its node groups at first, which will also fail in the unhealthy zone. It will be considered failed only when `maxNodeProvisionTime` lapses (usually close to `spec.provider.workers.machineControllerManager.machineCreationTimeout`) and only then put the node group into "back-off" and not retry for 5m at first and then exponentially longer. It's critical to keep that in mind and accommodate for it. If you have already capacity up and running, the reaction time is usually much faster with leases (whatever you set) or endpoints (`spec.kubernetes.kubeControllerManager.nodeMonitorPeriod.nodeMonitorGracePeriod`), but if you depend on new/fresh capacity, the above should inform you how long you will have to wait for it. +If a node fails to come up, the node group (worker pool in that zone) will go into "back-off", at first 5m, then [exponentially longer](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/utils/backoff/exponential_backoff.go#L77-L82) until the maximum of 30m is reached. The "back-off" is reset after 3 hours. This in turn means, that nodes must be first considered `Unknown`, which happens when `spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod` lapses (e.g. at the beginning of a zone outage). Then they must either remain in this state until `spec.provider.workers.machineControllerManager.machineHealthTimeout` lapses for them to be recreated, which will fail in the unhealthy zone, or `spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds` lapses for the pods to be evicted (usually faster than node replacements, depending on your configuration), which will trigger the cluster autoscaler to create more capacity, but very likely in the same zone as it tries to balance its node groups at first, which will fail in the unhealthy zone. It will be considered failed only when `maxNodeProvisionTime` lapses (usually close to `spec.provider.workers.machineControllerManager.machineCreationTimeout`) and only then put the node group into "back-off" and not retry for 5m (at first and then exponentially longer). Only then you can expect new node capacity to be brought up somewhere else. + +During the time of ongoing node provisioning (before a node group goes into "back-off"), the cluster autoscaler may have "virtually scheduled" pending pods onto those new upcoming nodes and will not reevaluate these pods anymore unless the node provisioning fails (which will fail during a zone outage, but the cluster autoscaler cannot know that and will therefore reevaluate its decision only after it has given up on the new nodes). + +It's critical to keep that in mind and accommodate for it. If you have already capacity up and running, the reaction time is usually much faster with leases (whatever you set) or endpoints (`spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod`), but if you depend on new/fresh capacity, the above should inform you how long you will have to wait for it and for how long pods might be pending (because capacity is generally missing and pending pods may have been "virtually scheduled" to new nodes that won't come up until the node group goes eventually into "back-off" and nodes in the healthy zones come up). #### On `spec.provider.workers.minimum`, `maximum`, `maxSurge`, `maxUnavailable`, `zones`, and `machineControllerManager`