Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job-based Service Bus Scaler scales to too many instances #4554

Closed
eugen-nw opened this issue May 19, 2023 · 26 comments
Closed

Job-based Service Bus Scaler scales to too many instances #4554

eugen-nw opened this issue May 19, 2023 · 26 comments
Labels
bug Something isn't working

Comments

@eugen-nw
Copy link

eugen-nw commented May 19, 2023

Report

Say that I configure KEDA with minReplicaCount > 0. If I send Messages to the Queue, that causes KEDA to create as many new Pods as how many Messages there are in the Queue, with no regard to the count of Jobs that are always running, i.e. those created by the minReplicaCount > 0,

Expected Behavior

Let's say that I configure KEDA to have 2 Jobs running permanently. If I send 5 Messages to the Queue, I'd expect KEDA to create only 3 new Pods Instead it is creating 5 new Pods, so they match the count of Messages in the Queue. Below is the scaling behavior that the documentation at https://keda.sh/docs/2.9/concepts/scaling-jobs/ states.

image

Actual Behavior

Please see above.

Steps to Reproduce the Problem

  1. Configure a KEDA Job deployment in a manner similar to the script below.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: aks-aci-boldiq-workforce-gozen-dev
  labels:
    app: aks-aci-boldiq-workforce-gozen-dev
    deploymentName: aks-aci-boldiq-workforce-gozen-dev
spec:
  jobTargetRef:
    template:
      spec:
        containers:  # this section is identical as for a "kind: Deployment"
        - image: <removed>
          imagePullPolicy: Always
          name: boldiq-workforce-gozen-dev
          resources:
            requests:
              memory: 8G
              cpu: 4
            limits:
              memory: 8G
              cpu: 4
          env:
          - name: KEDA_SERVICEBUS_CONNECTIONSTRING_GOZEN_DEV
            value: <removed>
        nodeSelector:
          kubernetes.io/os: windows
        tolerations:
        - key: virtual-kubelet.io/provider
          operator: Exists
        - key: azure.com/aci
          effect: NoSchedule
        imagePullSecrets:
          - name: docker-registry-secret
        nodeName: virtual-kubelet
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  pollingInterval: 1  # 1 second polling for max. responsiveness
  minReplicaCount: 2  # keeping two instances running permanently in order to improve low loads' performance
  maxReplicaCount: 10
  triggers:
  - type: azure-servicebus
#    metricType: Value // The default AverageValue with messageCount: '1' starts up a new Container for each Message in the Queue.  We want that for responsiveness.
    metadata:
      queueName: gozen-dev-requests
      connectionFromEnv: KEDA_SERVICEBUS_CONNECTIONSTRING_GOZEN_DEV
      messageCount: '1'
  1. Deploy the script and check the count of Pods created. Should be 2.

  2. Send N Messages into the Queue.

  3. Check the count of Pods created. It will be N + 2.

Logs from KEDA operator

Please email edaroczy@boldiq.com for the .ZIP file.

KEDA Version

2.10.1

Kubernetes Version

1.25

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

AKS 1.25.6
KEDA 2.10.2
The Containers run on the virtual-node-aci-linux virtual node.

@JorTurFer
Copy link
Member

Hi
I believe that the problem could be related with the short pollingInterval and the pod statuses. As KEDA is checking it every second, maybe pods aren't in a running state and KEDA thinks that there are missing jobs.
You can try increasing the pollingInterval or setting more states in pendingPodConditions
image

@eugen-nw
Copy link
Author

pollingInterval should have no relationship to the count of Pods that are already running. If I have 2 Pods already running and 5 Messages in the Queue, then I need the scale-out to fire up only 3 new Pods.

@JorTurFer
Copy link
Member

Could you enable the debug logs and share them? The operator logs in debug expose the queue length and the current job count

@eugen-nw
Copy link
Author

eugen-nw commented May 24, 2023 via email

@JorTurFer
Copy link
Member

#4541 (comment)

@eugen-nw
Copy link
Author

I have the bandwidth now to address this issue. What would you like me to do precisely, perhaps the steps below?

  1. Set in the deployment script the minReplicaCount of 1 and maxReplicaCount parameters for the Job and then deploy the script.
  2. Verify that the minReplicaCount Pod of 1 had started up.
  3. Send a Message in the Queue and verify that there are 2 Pods running instead of one.
  4. Provide the current log of the keda-operator-* Pod that resides in the keda namespace.

The behavior I'd expect to have is that if I already have a Job running and I send a Message into the Queue, there won't be a second Job starting up but have the currently running Job handling that one Message.

@JorTurFer
Copy link
Member

JorTurFer commented May 31, 2023

3. Send a Message in the Queue and verify that there are 2 Pods running instead of one.

I think that this shouldn't happen.

The behavior I'd expect to have is that if I already have a Job running and I send a Message into the Queue, there won't be a second Job starting up but have the currently running Job handling that one Message.

This is exactly the behavior I'd expect. Isn't this happening?

@eugen-nw
Copy link
Author

@JorTurFer No it does not happen. If have on Pod running - as per the minReplicaCount setting - and then I send a Message I see the second Pod starting up.

I've tried it as well with minReplicatCount set to 2 and sending 5 Messages. The end result was that I got 7 Pods running whereas only 5 would have been sufficient to process the 5 Messages.

@JorTurFer
Copy link
Member

@zroubalik , @tomkerkhove , Is this behavior intended and I'm missing something or is this a bug? I have checked the e2e tests and it's coverting this scenario

@eugen-nw
Copy link
Author

eugen-nw commented Jun 2, 2023

I thought about this a bit more and it may rather be a feature than a bug. Let's say that I configure a ScaledJob to have a minReplicaCount of 4. By this I express my desire to always have 4 Jobs on stand-by, ready to receive Messages. 2 Messages pop up, so two of my initial 4 Jobs are busy processing them, and by doing so those two are no longer available. In response to that, the ScaledJob starts up two new Jobs immediately, in order to ensure that 4 Jobs will be available soon.

Does this reasoning sounds right to you guys?

@JorTurFer
Copy link
Member

Does this reasoning sounds right to you guys?

I thought so, that's why I asked other teammates because that's the behavior covered by the e2e tests. Maybe it's just a documentation gap, but I'm not sure

@eugen-nw
Copy link
Author

eugen-nw commented Jun 2, 2023

Thank you. Let's see what response we'll receive.

However, since there are tests that test the behavior, it may be safe to update the documentation. And the behavior is indeed present, I'd tested it several times in the past two weeks and it does work very well :-))

@zroubalik
Copy link
Member

If you set minReplicas for ScaledJob, then it is basically a minimum number of jobs (a base) anything else should trigger more jobs. see the PR: #3426

@eugen-nw
Copy link
Author

eugen-nw commented Jun 5, 2023

Thanks very much @zroubalik!

Would it be possible to enhance the documentation of minReplicaCount at https://keda.sh/docs/2.9/concepts/scaling-jobs/ to explain the scale-out behavior dictated by the minReplicaCount parameter? In the current state of the documentation it explains only the fact that minReplicaCount Jobs will be created by default.

image

@JorTurFer
Copy link
Member

Would it be possible to enhance the documentation of minReplicaCount at keda.sh/docs/2.9/concepts/scaling-jobs to explain the scale-out behavior dictated by the minReplicaCount parameter?

It'd be amazing because it's true that it could be a bit confusing. Would you open a PR in docs with the change?

@eugen-nw
Copy link
Author

eugen-nw commented Jun 5, 2023

I'll give it a try. My first open source contribution...

@zroubalik
Copy link
Member

I'll give it a try. My first open source contribution...

It's never too late to start 😄 Just fork the docs repo, create a new branch and add the information and submit the PR. You might take some info or diagrams from the PR/issue I linked. If you find that useful.
Thanks 🙏

@eugen-nw
Copy link
Author

eugen-nw commented Jun 6, 2023

Done: kedacore/keda-docs#1144

@eugen-nw eugen-nw closed this as completed Jun 9, 2023
@LewisJackson1
Copy link

@JorTurFer @zroubalik we were just reading the docs kindly added by @eugen-nw and this really confused me. I can understand that someone may want this behaviour, but it feels like the expected behaviour here:

Let's say that I configure KEDA to have 2 Jobs running permanently. If I send 5 Messages to the Queue, I'd expect KEDA to create only 3 new Pods Instead it is creating 5 new Pods, so they match the count of Messages in the Queue.

is going to be a more common use case, or at least desired by some users.

Scaling out too much will cost us a considerable amount of money as we're processing videos on GPU Nodes.

@eugen-nw
Copy link
Author

eugen-nw commented Aug 16, 2023

You can limit the max. desired / allowed count of containers in the .yaml script. That will limit your expenses. In your example you will get 5 Jobs created to handle your 5 Messages + 2 other Jobs on stand-by to handle whatever may come in. All of these when the 5 new Pods are up and functional.

My scale-out scenario has to accommodate sudden bursts in demand. The current operation mode enables me to have N containers (more or less) ready to immediately handle a burst.

@LewisJackson1
Copy link

You can limit the max. desired / allowed count of containers in the .yaml script.

No matter what we set the max to we're always going to be spinning up containers for no reason. If two items come into our queue we don't need to spin up two additional Jobs with their own GPU Nodes and pay the minimum charge for that when we have two Jobs ready for them. If we set the maximum to the same as the minimum this wouldn't happen but we also would not be autoscaling.

My scale-out scenario has to accommodate sudden bursts in demand. The current operation mode enables me to have N containers (more or less) ready to immediately handle a burst.

I understand that this is a desirable use case for you and some others, but I doubt it's what most people would think the behaviour is when they see this parameter (which is why this issue was created).

@JorTurFer
Copy link
Member

Hi @LewisJackson1
So, you would like to have minReplicaCount always (let's say 2 for example), but in case of having jobs you want that one of those 2 is who manages the job, not having extra instances ready for working, right?
In that case, you want pre-warmed instances for the first jobs, but for next jobs, is waiting acceptable for them? I mean, you already have some ready pods to process those jobs when there isn't any pending job. Probably I'm missing something important in the middle because I don't get your use case :(

If waiting is not a problem and you prefer to save as much money as possible, you can set minReplicaCount: 0 (or just not set anything) and you will have 0 pending jobs

@LewisJackson1
Copy link

LewisJackson1 commented Aug 16, 2023

So, you would like to have minReplicaCount always (let's say 2 for example), but in case of having jobs you want that one of those 2 is who manages the job, not having extra instances ready for working, right?

Hello @JorTurFer, I'm not sure that I understand the question here, apologies!

In that case, you want pre-warmed instances for the first jobs, but for next jobs, is waiting acceptable for them? I mean, you already have some ready pods to process those jobs when there isn't any pending job.

Yeah, if additional jobs came in after the minimum replicas then they would have to wait for scaling and that's acceptable.

I guess the simplest way that I can think of to illustrate this is to compare the behaviour to a ScaledObject. If we configure a ScaledObject to track an SQS queue with 2 minimum replicas and 2 items enter the queue, the ScaledObject does not spin up 2 more Pods - is that correct?

We're looking at migrating a queue processor from ScaledObject to ScaledJob and I'm just finding this inconsistency between the two defined behaviours quite weird. I think that we could work around this with a static Deployment that would always be warm, then set the ScaledJob to track additional queue items?

@JorTurFer
Copy link
Member

JorTurFer commented Aug 16, 2023

We're looking at migrating a queue processor from ScaledObject to ScaledJob and I'm just finding this inconsistency between the two defined behaviours quite weird.

Yes, you are right and they aren't consistent, but they aren't comparable either IMHO. I mean, in ScaledObject, the workload can process multiple items, so just after finishing with a message, the workload starts with the next message without any cooldown. In ScaledJob, your job usually takes 1 single message and ends, so after finishing the current message, the pod finishes and KEDA spin up another job, which isn't instant. That's why the minimum replicas for ScaledJob is the minimum replicas ready to work (idle).

This is an interesting discussion, and maybe the best place is in a GH discussion, where other maintainers and any other community folk can give their 2 cents. Would you open a discussion about this?

In any case, for solving your use case, you could create your REST API (or gRPC Server) with the business logic that you want, and use Metrics API Scaler (or External Scaler) to connect KEDA to it. With this approach, you could set minReplicaCount:0 and provide from your server the desired amount of instances on each moment.

@eugen-nw
Copy link
Author

eugen-nw commented Aug 17, 2023

You may want to give the Job scale-out method some time to settle. Spend some time experimenting with both scale-out alternatives. Use Linux containers (vs. Windows) for faster Pod start-up times. Jobs will always handle totally long processings should that be a concern. With ScaledObject scale-out you'll pay for unused capacity. The best scenario is to have no Pods running 24 x 7 and use ScaledJob to fire up Pods whenever necessary; should that setup accommodate your use cases.

I operate in Azure cloud. Taking scale-out to the next level, I run no Pods in the Azure Kubernetes cluster but delegate them to run in the Azure Container Instances service by using a Virtual Kubelet. Thus we pay only for each second a Pod runs + we can scale out indefinitely.

@LewisJackson1
Copy link

This is an interesting discussion, and maybe the best place is in a GH discussion, where other maintainers and any other community folk can give their 2 cents. Would you open a discussion about this?

I've opened a discussion here: #4885

In ScaledJob, your job usually takes 1 single message and ends, so after finishing the current message, the pod finishes and KEDA spin up another job, which isn't instant. That's why the minimum replicas for ScaledJob is the minimum replicas ready to work (idle).

I feel like it is quite an opinionated stance for the scaler to take to assume that the user would want to have a buffer here as their Jobs are slow to start-up/terminate. I don't think there's that much difference between a Job and a persistent Pod, they both have a start-up latency so the over-provisioning behaviour could also be useful there. I can understand that this might be desirable for some people, and it'd be great to have this behaviour available for both ScaledJob and ScaledObject as an opt-in/out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants