@@ -102,9 +102,9 @@ checklist items _must_ be updated for the enhancement to be released.
102102
103103Items marked with (R) are required * prior to targeting to a milestone / release* .
104104
105- - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
106- - [ ] (R) KEP approvers have approved the KEP status as ` implementable `
107- - [ ] (R) Design details are appropriately documented
105+ - [x ] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
106+ - [x ] (R) KEP approvers have approved the KEP status as ` implementable `
107+ - [x ] (R) Design details are appropriately documented
108108- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
109109 - [ ] e2e Tests for all Beta API Operations (endpoints)
110110 - [ ] (R) Ensure GA e2e tests meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
@@ -283,7 +283,7 @@ when drafting this test plan.
283283[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
284284-->
285285
286- [ ] I/we understand the owners of the involved components may require updates to
286+ [ x ] I/we understand the owners of the involved components may require updates to
287287existing tests to make this code solid enough prior to committing the changes necessary
288288to implement this enhancement.
289289
@@ -335,7 +335,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
335335https://storage.googleapis.com/k8s-triage/index.html
336336-->
337337
338- - < test >: < link to test coverage >
338+ N/A, the feature is tested using unit tests and e2e tests.
339339
340340##### e2e tests
341341
@@ -491,7 +491,8 @@ well as the [existing list] of feature gates.
491491
492492- [x] Feature gate (also fill in values in ` kep.yaml ` )
493493 - Feature gate name: HPAConfigurableTolerance
494- - Components depending on the feature gate: ` kube-controller-manager `
494+ - Components depending on the feature gate: ` kube-controller-manager ` and
495+ ` kube-apiserver ` .
495496
496497###### Does enabling the feature change any default behavior?
497498
@@ -517,7 +518,8 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
517518
518519The feature can be disabled by restarting the ` kube-controller-manager ` with the feature gate set to ` false ` .
519520
520- Any ` tolerance ` values set on existing HPAs will be ignored by the ` kube-controller-manager ` when the feature gate is off.
521+ Any ` tolerance ` values set on existing HPAs will be ignored by the
522+ ` kube-controller-manager ` and ` kube-apiserver ` when the feature gate is off.
521523
522524###### What happens if we reenable the feature if it was previously rolled back?
523525
@@ -538,6 +540,9 @@ You can take a look at one potential example of such test in:
538540https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
539541-->
540542
543+ We will add a unit test verifying that HPAs with and without the new fields are
544+ properly validated, both when the feature gate is enabled or not.
545+
541546### Rollout, Upgrade and Rollback Planning
542547
543548<!--
@@ -594,6 +599,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
594599logs or events for this purpose.
595600-->
596601
602+ The presence of the new ` tolerance ` HPA field indicates that the feature is
603+ used.
604+
597605###### How can someone using this feature know that it is working for their instance?
598606
599607<!--
@@ -605,13 +613,18 @@ and operation of this feature.
605613Recall that end users cannot usually observe component logs or access metrics.
606614-->
607615
608- - [ ] Events
609- - Event Reason:
610- - [ ] API .status
611- - Condition name:
612- - Other field:
613- - [ ] Other (treat as last resort)
614- - Details:
616+ - [X] Events
617+ - Event Reason: ` SuccessfulRescale `
618+
619+ The tolerance is applied on the ratio between the _ current_ and _ desired_ metric
620+ values. Users can get both values using
621+ [ ` kubectl describe ` ] ( https://github.com/kubernetes/kubernetes/blob/1b7a0591871772fbbc0fda430b3b73bc24c0e738/staging/src/k8s.io/kubectl/pkg/describe/describe.go#L4109 )
622+ and use them to verify that scaling events are triggered when their ratio is out
623+ of tolerance.
624+
625+ We will update the controller-manager logs to help users understand the behavior
626+ of the autoscaler. The data added to the logs will include the tolerance used
627+ for each scaling decision.
615628
616629###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
617630
@@ -630,18 +643,21 @@ These goals will help you determine what you need to measure (SLIs) in the next
630643question.
631644-->
632645
646+ N/A.
647+
633648###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
634649
635650<!--
636651Pick one more of these and delete the rest.
637652-->
638653
639- - [ ] Metrics
640- - Metric name:
641- - [ Optional] Aggregation method:
642- - Components exposing the metric:
643- - [ ] Other (treat as last resort)
644- - Details:
654+ This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
655+ a new HPA behavior, but merely allows users to easily change the value of a
656+ parameter that's otherwise difficult to update.
657+
658+ Standard HPA metrics (e.g.
659+ ` horizontal_pod_autoscaler_controller_metric_computation_duration_seconds ` ) can
660+ be used to verify the HPA controller health.
645661
646662###### Are there any missing metrics that would be useful to have to improve observability of this feature?
647663
@@ -650,6 +666,12 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
650666implementation difficulties, etc.).
651667-->
652668
669+ Users may want to see a signal that autoscaling isn't happening because of the
670+ tolerance, but this is not directly related to this KEP (this problem already
671+ exists today with the hard-coded 10% tolerance), and taking this KEP as an
672+ opportunity to improve the situation is difficult (see
673+ [ this thread] ( https://github.com/kubernetes/enhancements/pull/4954#discussion_r1857098884 ) ).
674+
653675### Dependencies
654676
655677<!--
@@ -775,6 +797,8 @@ Are there any tests that were run/should be run to understand performance charac
775797and validate the declared limits?
776798-->
777799
800+ No.
801+
778802### Troubleshooting
779803
780804<!--
@@ -820,6 +844,8 @@ Major milestones might include:
820844- when the KEP was retired or superseded
821845-->
822846
847+ 2025-01-21: KEP PR merged.
848+
823849## Drawbacks
824850
825851<!--
0 commit comments