Skip to content

HPA metric got stuck at a random value and not scaling down after reaching max replica count #597

Open
@Naveen-oops

Description

@Naveen-oops

What happened?

I utilize two customized metrics, A and B, in my HPA system. A is a gauge-based metric called SLA Metric, while B is a count-based metric that tracks failed requests with HTTP status code 502 or 503 from Istio. These metrics are scraped by Prometheus.

To use custom metrics in HPA, we're employing Kube Metrics Adapter link. When the application load increases, the value of the SLA Metric also increases, and then the pods scale up until they reach the maximum replica count as expected.
However, the problem arises when the load dissipates, and the pods never scale down. Despite the SLA Metric's value being below the target in Prometheus, the HPA description still displays the metric value with a stale value that can be above or below the target.

One possible reason for this is that Metric B, which relies on Istio requests, shows up as unknown since there have been no failed requests with the 502 or 503 status codes. Thus, the Prometheus query fails.
We have noticed this behavior after upgrading Kube from version 1.21 to 1.24 , changing the HPA version from autoscaling/v2beta2 to autoscaling/v2, and changing the kube-metrics-adapter version from v0.1.16 to v0.1.19.

kubectl describe hpa my-hpa

Name:                                                          my-hpa
Namespace:                                                     namespace
Labels:                                                        app.kubernetes.io/managed-by=Helm
Annotations:                                                   meta.helm.sh/release-name: my-pod
                                                               meta.helm.sh/release-namespace: default
                                                               metric-config.object.avg-sla-breach.prometheus/query:
                                                                 avg(
                                                                  avg_over_time(
                                                                     is_sla_breach{
                                                                       app="my-pod",
                                                                       canary="false"
                                                                     }[10m]
                                                                  )
                                                                 )
                                                               metric-config.object.istio-requests-total.prometheus/per-replica: true
                                                               metric-config.object.istio-requests-total.prometheus/query:
                                                                 sum(
                                                                   rate(
                                                                     istio_requests_total{
                                                                       response_code=~"502|503",
                                                                       destination_service="my-pod.namespace.svc.cluster.local"
                                                                     }[1m]
                                                                   )
                                                                 ) /
                                                                 count(
                                                                   count(
                                                                     container_memory_usage_bytes{
                                                                       namespace="namespace",
                                                                       pod=~"my-pod.*"
                                                                     }
                                                                   ) by (pod)
                                                                 )
CreationTimestamp:                                             Wed, 12 Jul 2023 17:52:21 +0530
Reference:                                                     Deployment/my-pod
Metrics:                                                       ( current / target )
  "istio-requests-total" on Pod/my-pod (target value):    <unknown> / 200m
  "avg-sla-breach" on Pod/my-pod (target value):  833m / 500m
Min replicas:                                                  1
Max replicas:                                                  3
Deployment pods:                                               3 current / 3 desired
Conditions:
  Type            Status  Reason                 Message
  ----            ------  ------                 -------
  AbleToScale     True    SucceededGetScale      the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetObjectMetric  the HPA was unable to compute the replica count: unable to get metric istio-requests-total: Pod on namespace my-pod/unable to fetch metrics from custom metrics API: the server could not find the metric istio-requests-total for pods my-pod
  ScalingLimited  True    TooManyReplicas        the desired replica count is more than the maximum replica count
Events:
  Type     Reason                 Age                       From                       Message
  ----     ------                 ----                      ----                       -------
  Warning  FailedGetObjectMetric  2m14s (x140768 over 25d)  horizontal-pod-autoscaler  unable to get metric istio-requests-total: Pod on namespace my-pod/unable to fetch metrics from custom metrics API: the server could not find the metric istio-requests-total for pods my-pod

To troubleshoot this further we checked the metrics value using

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/my-namespace/pods/my-pod/avg-sla-breach"

Output:

{"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/my-namespace/pods/my-pod/avg-sla-breach"},"items":[{"describedObject":{"kind":"Pod","namespace":"my-namespace","name":"my-pod","apiVersion":"v1"},"metricName":"avg-sla-breach","timestamp":"2023-08-07T08:14:35Z","value":"0","selector":null}]}

Although the metric value appears as zero, we can observe that the HPA description displays a stagnant value.

Workaround :

HPA behaves as expected when the second metric B is completely removed or modified to return 0 when the query fails.

What did you expect to happen?

HPA should scale down properly based on one of the metrics, even when the other metric value is not available.

How can we reproduce it (as minimally and precisely as possible)?

  • Set up the Kube metrics adapter link .
  • Create a custom-metric-based HPA that uses two metrics among which the value of one is undefined.
  • Increase the load i.e., the value of the other metric so that the HPA kicks in and scales up the pods to the max replica count.
  • Reduce the load i.e. the value of the metric, it will be stuck at a random value.

Anything else we need to know?

Does anyone faced similar issues in hpa or is this behavior of hpa for multiple metrics changed recently, especially in scaling down events? Can anyone from the community look into the issue and give some clarity?

Kubernetes version

v1.24.7

Cloud provider

EKS

OS version

Alpine Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions