Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

bharathguvvala · 2024-03-26T10:31:43Z

Report

At our organization (Flipkart), we manage large k8s clusters where atleast 300 scaledobjects are present. Lately when we onboarded a bunch of new workloads to be managed by scaledobjects we saw substantial degradation to the HPA functionality where scaling events weren't happening inline with changes to corresponding compute metrics.

A bit of a background here -- recently we upgraded k8s from 1.22.x to 1.25.x and keda from 2.2.x to 2.10.x in order to migrate to HPA v2 as well as move to a newer keda to bring in all the goodness of an upgrade. Post this upgrade everything was fine and the scaledobject count at a cluster level is around 200+ . As mentioned above, when we onboarded a bunch of new scaledobjects (50+ in count) we started seeing substantial degradation to the autoscaling pipeline which impacted all the workloads since they were not scaled out/down in time. Based on our observations it took >10mins in general.

Upon troubleshooting and profiling a bunch of components such as controller manager, k8s metrics server, keda metrics adapter and operator, we saw that the external metrics served by the keda metrics adapter was latent and the p99 latencies were around ~5s. Following our investigation after profiling and code walkthrough we realized that the metrics scraping model has changed in one of the recent releases of keda where the metrics are fetched from the operator (via grpc client) as opposed to the earlier design where the metrics are fetched from the metrics adapter. We saw that in this flow, there are calls being made to the K8s API server to update a certain fallback health status of the scaledobjects. These calls were getting rate limited at the client side causing higher read latencies for the external metrics API. We realized that the k8s clients used by the reconciliation controllers and the metrics scraper are same and come under the default rate limits. Upon increasing these rate limits from the default (20 QPS and 30 burst) the issue is resolved.

While we weren't aware of this design change which sort of merged the scraping component into the operator, we were confused as to why there are k8s API update calls being made as part of the metrics read path. We feel a couple of changes can be made to make keda users aware of deploying and supporting large scale clusters:

Update the health status calls only for scaledobjects which have fallback behaviour defined. In our case there was no fallback defined but still the status update calls were happening which led to this issue.
Now that, since k8s client used by reconciliation controllers and scraper components is same, both these flows are sort of coupled without isolation where one flow can swarm and get other flows to get rate limited (client side). Either separate the flows to use different clients or document and highlight the fact so that users are aware and can configure the ratelimits accordingly in the keda operator.

Expected Behavior

Perform the health status update calls only for scaledobjects which have fallback behaviour defined since that status information is useful only for fallback workflow. In our case there was no fallback defined but still the status update calls were happening which led to this issue.
Now that, since k8s client used by reconciliation controllers and scraper components is same, both these flows are sort of coupled without isolation where one flow can swarm and get other flows to get rate limited (client side). Either separate the flows to use different clients or document and highlight the fact so that users are aware and can configure the ratelimits accordingly in the keda operator

Actual Behavior

Delays in HPA pipelines

Steps to Reproduce the Problem

Deploy with default configuration and create 300+ scaledobjects.

KEDA Version

2.10.1

Kubernetes Version

< 1.26

Platform

Other

Scaler Details

No response

Anything else?

No response

bharathguvvala · 2024-03-26T12:40:31Z

Willing to participate in the discussion and to contribute to code/documentation.

JorTurFer · 2024-03-26T19:16:15Z

Hello @bharathguvvala
Thanks for reporting the gap in docs and sorry the problem :(

I think that the problem should be addressed by improving our docs about this and how to solve it. I think to address this from docs because updating the CR is part of the KEDA operation (not depending on the fallback) and it's part of the CRD, so users can build checking systems on top of it.

Just giving some context about that change, the underlying controller added the limit rate support with a quite restrictive values (5 and 10 IIRC), we increased those values for the most common user type (< 100 ScaledObject) to prevent bursting API server by mistake.

If you are willing to help with the documentation, it'd be awesome! Real user experiences are the best teachers for other folks

bharathguvvala · 2024-03-27T12:58:19Z

@JorTurFer Whichever scaledobjects have not been configured with the fallback option , isn't it better to skip updating the fallback health status for every metrics read call, which would avoid redundant API calls to the K8s API Server? Instead this condition can be updated from the controller during scaledobject reconciliations? Also means that the fallback health stats are only updated for scaledobjects where the fallback is enabled and for the rest there are defaulted during the time of the scaledobject reconciliation.

bharathguvvala · 2024-03-27T12:59:16Z

Regarding the documentation, I'll raise a PR in the next the couple of days which provides guidance to setup and configure KEDA on large clusters -- with high number of deployments.

JorTurFer · 2024-03-30T16:59:57Z

@JorTurFer Whichever scaledobjects have not been configured with the fallback option , isn't it better to skip updating the fallback health status for every metrics read call, which would avoid redundant API calls to the K8s API Server? Instead this condition can be updated from the controller during scaledobject reconciliations? Also means that the fallback health stats are only updated for scaledobjects where the fallback is enabled and for the rest there are defaulted during the time of the scaledobject reconciliation.

Your point it's interesting, it's true that updating the fallback all the time if the feature is disabled doesn't make sense. Checking the code, I have noticed that we can be updating the value although there isn't any change.

Checking the fallback logic, we are callign to updateStatus, and there, we don't check if the status has really changed before patching the resource:

keda/pkg/fallback/fallback.go

Lines 115 to 129 in f2d86a8

    
           func updateStatus(ctx context.Context, client runtimeclient.Client, scaledObject *kedav1alpha1.ScaledObject, status *kedav1alpha1.ScaledObjectStatus, metricSpec v2.MetricSpec) { 
        
           	patch := runtimeclient.MergeFrom(scaledObject.DeepCopy()) 
        
           	if fallbackExistsInScaledObject(scaledObject, metricSpec) { 
        
           		status.Conditions.SetFallbackCondition(metav1.ConditionTrue, "FallbackExists", "At least one trigger is falling back on this scaled object") 
        
           	} else { 
        
           		status.Conditions.SetFallbackCondition(metav1.ConditionFalse, "NoFallbackFound", "No fallbacks are active on this scaled object") 
        
           	} 
        
           	scaledObject.Status = *status 
        
           	err := client.Status().Patch(ctx, scaledObject, patch) 
        
           	if err != nil { 
        
           		log.Error(err, "failed to patch ScaledObjects Status", "scaledObject.Namespace", scaledObject.Namespace, "scaledObject.Name", scaledObject.Name) 
        
           	} 
        
           }

I think that we can improve that logic to reduce the calls to the API server. @zroubalik @dttung2905 @wozniakjan WDYT?

wozniakjan · 2024-04-02T12:00:21Z

Checking the fallback logic, we are callign to updateStatus, and there, we don't check if the status has really changed before patching

That is a pretty good optimization, especially given there is already DeepCopy available

keda/pkg/fallback/fallback.go

Line 116 in f2d86a8

patch := runtimeclient.MergeFrom(scaledObject.DeepCopy())

and the check could be as simple as !reflect.DeepEqual()

Regarding the documentation, I'll raise a PR in the next the couple of days which provides guidance to setup and configure KEDA on large clusters -- with high number of deployments.

Thank you @bharathguvvala, that would be terrific. Also, feel free to introduce code improvements, generally it's easier to get merged smaller PRs.

bharathguvvala · 2024-04-03T07:34:36Z

@wozniakjan @JorTurFer I have made a change to disable health status updates if the fallback for the scaledobject is not configured. I will go ahead with adding the tests if this change is okayed in terms of the intent. We could do additional logic to avoid redundant updates where a fallback is configured in a scaledobject , on top of this.

bharathguvvala · 2024-04-03T07:47:48Z

@JorTurFer @wozniakjan Taking a step back, I was thinking if this information around the error count needs to be updated back to the scaledobject status. Since this is transient information used to implement some sort of circuit breaking isn't it appropriate to keep this information inside the operator (in memory) and only update the condition whenever it flips?

I presume this information is updated in the status only to make it persistent and survive across multiple operator restarts but it's also expensive considering that an update is performed in the read path of the GetMetrics and can potentially affect the GetMetrics latencies which in turn can affect the autoscaler SLOs. If the same error count information can be reconstructed from scratch based on the new set of errors then why not avoid persistenting it?

JorTurFer · 2024-04-07T17:22:44Z

If the same error count information can be reconstructed from scratch based on the new set of errors then why not avoid persistenting it?

What do you mean? how would you reconstruct it? It's an interesting point

stale · 2024-06-07T13:45:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

bharathguvvala added the bug Something isn't working label Mar 26, 2024

bharathguvvala mentioned this issue Apr 3, 2024

Refactored fallback logic to only patch status when the fallback is e… #5659

Merged

3 tasks

stale bot added the stale All issues that are marked as stale due to inactivity label Jun 7, 2024

zroubalik removed the stale All issues that are marked as stale due to inactivity label Jun 7, 2024

zroubalik mentioned this issue Jun 7, 2024

chore: fallback - patch ScaledObject.Status only when there is a difference #5871

Merged

3 tasks

JorTurFer closed this as completed in #5659 Aug 2, 2024

deefreak mentioned this issue Aug 12, 2024

Performance Degradation while scaling out large number of Deployments, 700<N<1250 #6063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

bharathguvvala commented Mar 26, 2024 •

edited

Loading

bharathguvvala commented Mar 26, 2024 •

edited

Loading

JorTurFer commented Mar 26, 2024

bharathguvvala commented Mar 27, 2024

bharathguvvala commented Mar 27, 2024

JorTurFer commented Mar 30, 2024

wozniakjan commented Apr 2, 2024 •

edited

Loading

bharathguvvala commented Apr 3, 2024

bharathguvvala commented Apr 3, 2024

JorTurFer commented Apr 7, 2024

stale bot commented Jun 7, 2024

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

Comments

bharathguvvala commented Mar 26, 2024 • edited Loading

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

bharathguvvala commented Mar 26, 2024 • edited Loading

JorTurFer commented Mar 26, 2024

bharathguvvala commented Mar 27, 2024

bharathguvvala commented Mar 27, 2024

JorTurFer commented Mar 30, 2024

wozniakjan commented Apr 2, 2024 • edited Loading

bharathguvvala commented Apr 3, 2024

bharathguvvala commented Apr 3, 2024

JorTurFer commented Apr 7, 2024

stale bot commented Jun 7, 2024

bharathguvvala commented Mar 26, 2024 •

edited

Loading

bharathguvvala commented Mar 26, 2024 •

edited

Loading

wozniakjan commented Apr 2, 2024 •

edited

Loading