Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric names become incorrect if any of ScaledObject's triggers is unavailable #2592

Closed
rwkarg opened this issue Feb 3, 2022 · 9 comments · Fixed by #2593
Closed

Metric names become incorrect if any of ScaledObject's triggers is unavailable #2592

rwkarg opened this issue Feb 3, 2022 · 9 comments · Fixed by #2593
Labels
bug Something isn't working
Milestone

Comments

@rwkarg
Copy link
Contributor

rwkarg commented Feb 3, 2022

Report

After some amount of time of successful operation, the HPA reports that some of the metrics it is looking for are not available. This is because the name of the metric in the HPA has been changed to something that appears to be incorrect. Restarting the operator and metrics-apiserver will temporarily correct the metric names in the HPA.

Initial report from Slack: https://kubernetes.slack.com/archives/C01JGDP8MB8/p1643049613000900

I first observed this with 2.5.0 and it is still occurring with 2.6.0. Prior to 2.5.0 we were running 2.2.0 so I don't know about the versions in between.

Initial list of metrics in HPA when it's working (note the sn- prefix is all unique numbers strictly increasing):

  externalMetricNames:
  - s0-rabbitmq-Europa-OnlinePlayerStatus-processUpdateIssueOnlineStatusRequest
  - s1-rabbitmq-Europa-OnlinePlayerStatus-processUpdateIssueOnlineStatusRequest
  - s2-rabbitmq-Europa-OnlinePlayerStatus-processSyncRequest
  - s3-rabbitmq-Europa-OnlinePlayerStatus-processSyncRequest
  - s4-rabbitmq-Europa-OnlinePlayerStatus-processIssueResolved
  - s5-rabbitmq-Europa-OnlinePlayerStatus-processIssueResolved
  - s6-rabbitmq-Europa-OnlinePlayerStatus-processIssueCreated
  - s7-rabbitmq-Europa-OnlinePlayerStatus-processIssueCreated

That same HPA later stops scaling appropriate (scales out to max instances) and reports the following metric names (note the duplicate s7- prefixes and the absence of an s1- prefix. The s7-rabbitmq-Europa-OnlinePlayerStatus-processUpdateIssueOnlineStatusRequest metric is reported as being unavailable which makes sense since it is actually s1-rabbit... on the metrics server)

  externalMetricNames:
  - s0-rabbitmq-Europa-OnlinePlayerStatus-processUpdateIssueOnlineStatusRequest
  - s7-rabbitmq-Europa-OnlinePlayerStatus-processUpdateIssueOnlineStatusRequest
  - s2-rabbitmq-Europa-OnlinePlayerStatus-processSyncRequest
  - s3-rabbitmq-Europa-OnlinePlayerStatus-processSyncRequest
  - s4-rabbitmq-Europa-OnlinePlayerStatus-processIssueResolved
  - s5-rabbitmq-Europa-OnlinePlayerStatus-processIssueResolved
  - s6-rabbitmq-Europa-OnlinePlayerStatus-processIssueCreated
  - s7-rabbitmq-Europa-OnlinePlayerStatus-processIssueCreated

Expected Behavior

The metric names written to the HPA should match the metrics being sent to the metrics server.

Actual Behavior

The metric names deviate from what is being sent to the metrics server.

Steps to Reproduce the Problem

  1. Install KEDA 2.5.0 or 2.6.0
  2. Deploy ScaledObject with RabbitMQ triggers (8 triggers in the above example)
  3. Wait...
  4. Sometimes the metric names will change as noted above

I'm running KEDA on 12 identical clusters right now and most of them don't run in to this issue. For those that do, it can be as little as a few hours to a few days before this shows up.

Logs from KEDA operator

example

KEDA Version

2.6.0

Kubernetes Version

1.21

Platform

Google Cloud

Scaler Details

RabbitMQ

Anything else?

scalerIndex is pulled from this for loop which is where the number for the sn- prefix of metric names is pulled from:

for scalerIndex, t := range withTriggers.Spec.Triggers {

Given that, I'm not clear on how, in the above example, there is no s1- prefix and there are duplicate s7- prefixes.

Initially there was thought that this might be related to #2407 which was implemented in 2.6.0, but after upgrading to 2.6.0, this is still occurring.

@rwkarg rwkarg added the bug Something isn't working label Feb 3, 2022
@zroubalik
Copy link
Member

@rwkarg do you see those incorrect metric names in metrics server? What metric names do you see in the ScaledObject.Status?

@rwkarg
Copy link
Contributor Author

rwkarg commented Feb 4, 2022

The ScaledObject has the same incorrect list of metric names (duplicate s7- in the above example).

The metrics server has the expected metrics (single instance of s0- through s7-)

@rwkarg
Copy link
Contributor Author

rwkarg commented Feb 4, 2022

Also noted that when this happens, the duplicate is always the last index
For Example, the above case has 8 metrics (s0- through s7-) and the duplicate, wrong metric name will always start with s7-, though the index that is missing (what the metric name actually is) appears to be random (s1- is missing above, but it could be any one of those).
Similarly, if there are only 4 metrics, then s3- is always the duplicated prefix.

@rwkarg
Copy link
Contributor Author

rwkarg commented Feb 4, 2022

Maybe found something?
ERROR controller.scaledobject Failed to check HPA for possible update {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "global-processor", "namespace": "europa-us", "error": "metricName s9-rabbitmq-global-created-queue defined multiple times in ScaledObject global-processor, please refer the documentation how to define metricName manually

Maybe this is because there is both a message and rate based trigger on the same queue, but -rate was removed from the rate based metric name (it's just rabbitmq-<queue_name> for both scaling types so they would be the same metric name, other than the sn- prefix.

I don't understand exactly how there's a duplicate with the sn- prefix, but that is an error that is preventing the HPA from being updated according to the logs.

@zroubalik
Copy link
Member

Maybe found something? ERROR controller.scaledobject Failed to check HPA for possible update {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "global-processor", "namespace": "europa-us", "error": "metricName s9-rabbitmq-global-created-queue defined multiple times in ScaledObject global-processor, please refer the documentation how to define metricName manually

Maybe this is because there is both a message and rate based trigger on the same queue, but -rate was removed from the rate based metric name (it's just rabbitmq-<queue_name> for both scaling types so they would be the same metric name, other than the sn- prefix.

I don't understand exactly how there's a duplicate with the sn- prefix, but that is an error that is preventing the HPA from being updated according to the logs.

I think that the sn- prefix should fix that, IMHO message vs rate based trigger is not the culprit here.

By chance, do you see any other errors? One of the rabbit instances might have some temporary issues, causing that scaler to being able not connect there?
I am trying to figure out where could be the problem, one thing that come across my mind is that it could be a problem in resolving the scaler. So when KEDA tries to refresh that particular scaler, there might be some mismatch between cache & metricName generation? Some wrong indexing or similar nasty bug.

Anything special about your ScaledObjects config?

@zroubalik
Copy link
Member

zroubalik commented Feb 4, 2022

I have probably found the bug: #2593 and we will most like release 2.6.1 release next week with this fix.

@rwkarg it would be great if you can check that fix on one of your setups, of course if there's a possibility to do so.
Images with the fix included are hosted here: quay.io/zroubalik/keda:indexFix, quay.io/zroubalik/keda-metrics-apiserver:indexFix

Thanks!

@rwkarg
Copy link
Contributor Author

rwkarg commented Feb 4, 2022

Testing out indexFix and things are looking good so far. Will continue to monitor.

@rwkarg
Copy link
Contributor Author

rwkarg commented Feb 7, 2022

Fix is looking good.

All HPAs across all clusters still have the correct metric names after the weekend.

@zroubalik
Copy link
Member

@rwkarg excellent, thanks for the message!

@zroubalik zroubalik changed the title Metric names in HPA become incorrect and workload scales to max Metric names in become incorrect if any of the ScaledObject's trigger is unavailable Feb 7, 2022
@zroubalik zroubalik changed the title Metric names in become incorrect if any of the ScaledObject's trigger is unavailable Metric names in become incorrect if any of ScaledObject's trigger is unavailable Feb 7, 2022
@zroubalik zroubalik changed the title Metric names in become incorrect if any of ScaledObject's trigger is unavailable Metric names in become incorrect if any of ScaledObject's triggers is unavailable Feb 7, 2022
@zroubalik zroubalik changed the title Metric names in become incorrect if any of ScaledObject's triggers is unavailable Metric names become incorrect if any of ScaledObject's triggers is unavailable Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants