-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constants crashes in keda operator after deploying service controlled by scaledobject #4389
Comments
@reynoldsme I opened an issue with KEDA |
Another such panic
|
Hi, could you please clarify what do you mean by:
What service? The workload that is targeted by scaleTargetRef? Could you paste here example of ScaledObject? |
Yes, that's the service I mean. Here's an example scaled object. Our use case for now is that the service targeted needs to know a lot of metrics about our cluster in order to drive traffic into a file processing pipeline only when traffic and usage are low, so we are trying to use KEDA to have a common interface to query the metrics with the external metrics client. The service was working fine when KEDA was not crashing. We were also able to query the metrics via Not posting the actual scaled object due to being work stuff, but I've tried to match the options as closely as I can. I also tried to add the annotation to pause the autoscaling with 2 replicas, but that didn't have any effect.
There's nothing special about the object other than the amount of triggers and the high query values meant to disable the scaling. But the min and max replicas are the same, so that shouldn't matter anyway. And all the metrics were working before the crashes in the KEDA controller. |
So you first deploy the ScaledObject and then the service? Also the Could you please do this:
Thanks |
I work with @martinmr Here is an example crash (the names of some namespaces and ScaledObjects have been replaced) after deploying the workload, ScaledObject, then recreating the keda-operator pod via
Please note that the ScaledObject It is also notable that the microservice targeted by the ScaledObject also queries these 42 metrics, but keda metrics server itself is stable. |
@reynoldsme that might be the issue. Btw are you sure that all scalers are defined correctly? My bet is that one of the 42 (yeah that unusual cout 😄 ) if failing. Would be great if you are able to confirm this theory, for example starting with ScaledObject that has only 1 trigger and constantly add one after another until you got the crash and then try just the scaler which caused the crash alone in the ScaledObject to double check it is failing. |
@zroubalik I did as you suggested, but it's not our configs. I could get it working again (I can query the metrics) after removing some triggers and redeploying, but as soon as I deploy the target deployment it breaks again. This issue is on KEDA's end. Getting it to work again is not consistent. It's currently stuck in a crash loop with this error:
|
We updated to KEDA 2.10.0 and are seeing a new crash:
All the crashes seem related to updating the status of the scaled object, which we do not control. The code is autogenerated so I can't even look at it. |
Thanks for the update, I am not saying the problem is on your side, it is obviously on KEDA's side, what I saying, that one of your trigger's configuration in the ScaledObject is incorrect (maybe a misspell, typo, wrong credentials..) which in the end is causing this problem - which should not! Looking at the latest erorr, this is super weird. It is crasing in the autogenerated part of the code, as you mentioned. Could you please confirm that the crash on v2.10.0 that you pasted here is the very first after you deploy the ScaledObject? What is the number of triggers that works for you correctly? |
There's no wrong trigger. I managed to get it working again with a smaller number of triggers as part of the testing but it stopped after a while (fifteen or twenty minutes). I didn't deploy any changes in the interim. So I don't think this has to do with the config. The error I pasted above is the first one after the upgrade but I've seen others as well. @reynoldsme could you paste some more errors if you can? I don't have internet service today (typing this on my phone using roaming data). |
ok, I have performed the following steps:
|
We are having the same issue. on different AKS clusters with different but similar ScaledObjects. I'm running This ScaledObject is working This ScaledObject is causing the Panic (they are identical, except they have different names)
|
From the Code, it seems that the ScaledObject referenced received by : resolver.ResolveScaleTargetPodSpec is null, since the panic occurs when it tries to access a property after the object gets cast. I'm not sure why the the object would be null or become null... I think nulll protection/detection is missing non the less. |
Thanks for the provided info, the failing SO is missing There might be some race condition in processing the SO. I will try to investigate later, will be away next two weeks, but will try to check it then. If you happen to find any more details in the meantime, please attach them here. |
@zroubalik I don't have any thing more to share expect that we had the same issue again yesterday. this time it was a different ScaledObject. after we deleted it, everything went back to normal. |
just happened to us too, deleting the scaled object and recreating it solved the issue |
I am working in same project with @timown . This happened in production today. We tried with 2.10.1 and 2.9.x and still got the same issue. Here are the attached logs from operator for your reference: "namespace": "", "name": "", "reconcileID": "973016f5-c838-44e9-82c6-f3d49afa52a7", "trigger.type": "external"} goroutine 398 [running]: |
@saurabhvagrawal @timown @djsly @reynoldsme @martinmr et all: could you please confirm that the failing ScaledObject uses |
Is anybody here willing to test a patched version before the official release is out? |
it is, in our case it uses cpu, memory and external |
@zroubalik None of the ScaledObjects where we see this issue were using triggers of type We were only seeing this on ScaledObjects with triggers of type |
The |
In case anyone would like to try it out: To apply the fix KEDA Operator image needs to be changed to |
Sure @zroubalik. But before that, can we get the changelog/PR please to understand what is fixed. |
@zroubalik : kind ping on this. |
@saurabhvagrawal it's released KEDA 2.10.1 + following commit |
Will this fix be applied to keda 2.9.x too? We're seeing this problem since yesterday... EDIT: Situation update So it seems to be a race condition of some kind and kind hard to reproduce... |
If we got a confirmation that it resolves the issue, then we can probably think about backporting it to 2.9 as well |
@saurabhvagrawal have you got a chance to try the fix? Or anybody else? |
it happened to us once (and we have 200 clusters) |
@timown with the fix? |
no no, sorry, with the latest official release |
I agree with @timown. @zroubalik : We are unable to reproduce the said issue and not sure how to verify the fix. But looking at your commit, do you think nil pointer is due to because its unable to get ScaledObject resource since we are checking if its nil and if yes, it attempts to retrieve the corresponding ScaledObject. |
Report
Keda controller is constantly crashing after I deploy a new version of the service targeted by the scaled object.
It tends to work for a while but after deploying the service, no metrics can be queried. The Keda controller logs all spit a bunch or errors, but all of them are related to the
GetMetrics
function.Expected Behavior
No crashes
Actual Behavior
Constant crashes in the keda controller
Steps to Reproduce the Problem
Logs from KEDA operator
KEDA Version
2.9.2
Kubernetes Version
1.23
Platform
Amazon Web Services
Scaler Details
Datadog
Anything else?
The only thing weird about this scaler is that it has around 40 triggers. We are using this service to have a single interface to query the metrics provided by KEDA. I set the min/max replicas to 2. I even disabled autoscaling with 2 replicas, but that didn't help. But I don't think the scaledobject config is the issue because we can query the metrics for a little while.
Destroying keda and redeploying seemed to work for a while but it always breaks down around the time the service is deployed.
The text was updated successfully, but these errors were encountered: