Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colocating metrics provider along with the operator causes HPA delays if not configured properly #5624

Closed
bharathguvvala opened this issue Mar 26, 2024 · 10 comments · Fixed by #5659
Labels
bug Something isn't working

Comments

@bharathguvvala
Copy link
Contributor

bharathguvvala commented Mar 26, 2024

Report

At our organization (Flipkart), we manage large k8s clusters where atleast 300 scaledobjects are present. Lately when we onboarded a bunch of new workloads to be managed by scaledobjects we saw substantial degradation to the HPA functionality where scaling events weren't happening inline with changes to corresponding compute metrics.

A bit of a background here -- recently we upgraded k8s from 1.22.x to 1.25.x and keda from 2.2.x to 2.10.x in order to migrate to HPA v2 as well as move to a newer keda to bring in all the goodness of an upgrade. Post this upgrade everything was fine and the scaledobject count at a cluster level is around 200+ . As mentioned above, when we onboarded a bunch of new scaledobjects (50+ in count) we started seeing substantial degradation to the autoscaling pipeline which impacted all the workloads since they were not scaled out/down in time. Based on our observations it took >10mins in general.

Upon troubleshooting and profiling a bunch of components such as controller manager, k8s metrics server, keda metrics adapter and operator, we saw that the external metrics served by the keda metrics adapter was latent and the p99 latencies were around ~5s. Following our investigation after profiling and code walkthrough we realized that the metrics scraping model has changed in one of the recent releases of keda where the metrics are fetched from the operator (via grpc client) as opposed to the earlier design where the metrics are fetched from the metrics adapter. We saw that in this flow, there are calls being made to the K8s API server to update a certain fallback health status of the scaledobjects. These calls were getting rate limited at the client side causing higher read latencies for the external metrics API. We realized that the k8s clients used by the reconciliation controllers and the metrics scraper are same and come under the default rate limits. Upon increasing these rate limits from the default (20 QPS and 30 burst) the issue is resolved.

While we weren't aware of this design change which sort of merged the scraping component into the operator, we were confused as to why there are k8s API update calls being made as part of the metrics read path. We feel a couple of changes can be made to make keda users aware of deploying and supporting large scale clusters:

  1. Update the health status calls only for scaledobjects which have fallback behaviour defined. In our case there was no fallback defined but still the status update calls were happening which led to this issue.
  2. Now that, since k8s client used by reconciliation controllers and scraper components is same, both these flows are sort of coupled without isolation where one flow can swarm and get other flows to get rate limited (client side). Either separate the flows to use different clients or document and highlight the fact so that users are aware and can configure the ratelimits accordingly in the keda operator.

Expected Behavior

  1. Perform the health status update calls only for scaledobjects which have fallback behaviour defined since that status information is useful only for fallback workflow. In our case there was no fallback defined but still the status update calls were happening which led to this issue.
  2. Now that, since k8s client used by reconciliation controllers and scraper components is same, both these flows are sort of coupled without isolation where one flow can swarm and get other flows to get rate limited (client side). Either separate the flows to use different clients or document and highlight the fact so that users are aware and can configure the ratelimits accordingly in the keda operator

Actual Behavior

Delays in HPA pipelines

Steps to Reproduce the Problem

Deploy with default configuration and create 300+ scaledobjects.

KEDA Version

2.10.1

Kubernetes Version

< 1.26

Platform

Other

Scaler Details

No response

Anything else?

No response

@bharathguvvala bharathguvvala added the bug Something isn't working label Mar 26, 2024
@bharathguvvala
Copy link
Contributor Author

bharathguvvala commented Mar 26, 2024

Willing to participate in the discussion and to contribute to code/documentation.

@JorTurFer
Copy link
Member

Hello @bharathguvvala
Thanks for reporting the gap in docs and sorry the problem :(

I think that the problem should be addressed by improving our docs about this and how to solve it. I think to address this from docs because updating the CR is part of the KEDA operation (not depending on the fallback) and it's part of the CRD, so users can build checking systems on top of it.

Just giving some context about that change, the underlying controller added the limit rate support with a quite restrictive values (5 and 10 IIRC), we increased those values for the most common user type (< 100 ScaledObject) to prevent bursting API server by mistake.

If you are willing to help with the documentation, it'd be awesome! Real user experiences are the best teachers for other folks

@bharathguvvala
Copy link
Contributor Author

@JorTurFer Whichever scaledobjects have not been configured with the fallback option , isn't it better to skip updating the fallback health status for every metrics read call, which would avoid redundant API calls to the K8s API Server? Instead this condition can be updated from the controller during scaledobject reconciliations? Also means that the fallback health stats are only updated for scaledobjects where the fallback is enabled and for the rest there are defaulted during the time of the scaledobject reconciliation.

@bharathguvvala
Copy link
Contributor Author

Regarding the documentation, I'll raise a PR in the next the couple of days which provides guidance to setup and configure KEDA on large clusters -- with high number of deployments.

@JorTurFer
Copy link
Member

@JorTurFer Whichever scaledobjects have not been configured with the fallback option , isn't it better to skip updating the fallback health status for every metrics read call, which would avoid redundant API calls to the K8s API Server? Instead this condition can be updated from the controller during scaledobject reconciliations? Also means that the fallback health stats are only updated for scaledobjects where the fallback is enabled and for the rest there are defaulted during the time of the scaledobject reconciliation.

Your point it's interesting, it's true that updating the fallback all the time if the feature is disabled doesn't make sense. Checking the code, I have noticed that we can be updating the value although there isn't any change.

Checking the fallback logic, we are callign to updateStatus, and there, we don't check if the status has really changed before patching the resource:

func updateStatus(ctx context.Context, client runtimeclient.Client, scaledObject *kedav1alpha1.ScaledObject, status *kedav1alpha1.ScaledObjectStatus, metricSpec v2.MetricSpec) {
patch := runtimeclient.MergeFrom(scaledObject.DeepCopy())
if fallbackExistsInScaledObject(scaledObject, metricSpec) {
status.Conditions.SetFallbackCondition(metav1.ConditionTrue, "FallbackExists", "At least one trigger is falling back on this scaled object")
} else {
status.Conditions.SetFallbackCondition(metav1.ConditionFalse, "NoFallbackFound", "No fallbacks are active on this scaled object")
}
scaledObject.Status = *status
err := client.Status().Patch(ctx, scaledObject, patch)
if err != nil {
log.Error(err, "failed to patch ScaledObjects Status", "scaledObject.Namespace", scaledObject.Namespace, "scaledObject.Name", scaledObject.Name)
}
}

I think that we can improve that logic to reduce the calls to the API server. @zroubalik @dttung2905 @wozniakjan WDYT?

@wozniakjan
Copy link
Member

wozniakjan commented Apr 2, 2024

Checking the fallback logic, we are callign to updateStatus, and there, we don't check if the status has really changed before patching

That is a pretty good optimization, especially given there is already DeepCopy available

patch := runtimeclient.MergeFrom(scaledObject.DeepCopy())

and the check could be as simple as !reflect.DeepEqual()

Regarding the documentation, I'll raise a PR in the next the couple of days which provides guidance to setup and configure KEDA on large clusters -- with high number of deployments.

Thank you @bharathguvvala, that would be terrific. Also, feel free to introduce code improvements, generally it's easier to get merged smaller PRs.

@bharathguvvala
Copy link
Contributor Author

@wozniakjan @JorTurFer I have made a change to disable health status updates if the fallback for the scaledobject is not configured. I will go ahead with adding the tests if this change is okayed in terms of the intent. We could do additional logic to avoid redundant updates where a fallback is configured in a scaledobject , on top of this.

@bharathguvvala
Copy link
Contributor Author

@JorTurFer @wozniakjan Taking a step back, I was thinking if this information around the error count needs to be updated back to the scaledobject status. Since this is transient information used to implement some sort of circuit breaking isn't it appropriate to keep this information inside the operator (in memory) and only update the condition whenever it flips?

I presume this information is updated in the status only to make it persistent and survive across multiple operator restarts but it's also expensive considering that an update is performed in the read path of the GetMetrics and can potentially affect the GetMetrics latencies which in turn can affect the autoscaler SLOs. If the same error count information can be reconstructed from scratch based on the new set of errors then why not avoid persistenting it?

@JorTurFer
Copy link
Member

If the same error count information can be reconstructed from scratch based on the new set of errors then why not avoid persistenting it?

What do you mean? how would you reconstruct it? It's an interesting point

Copy link

stale bot commented Jun 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jun 7, 2024
@zroubalik zroubalik removed the stale All issues that are marked as stale due to inactivity label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Ready To Ship
Development

Successfully merging a pull request may close this issue.

4 participants