better documentation around scaling strategy #1463

jason-berk-k1x · 2024-08-31T04:08:14Z

I'm using ScaledJob and I'm having a lot of confusion trying to understand the scaling strategies and how they differ.

my ScaledJob is triggered from an Azure Service Bus Queue and is configured like so:

job:
  paused: "false"
  activeDeadlineSeconds: 600
  pollingInterval: 30
  minReplicaCount: 0 
  maxReplicaCount: 3
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 10
  scalingStrategy: "eager"
  trigger:
      queueName: some-queue-name
      messageCount: "1"
      auth: my-cluster-trigger-auth

my goal is to have a ScaledJob defined that is triggered to run when messages land on the queue.....up to three Jobs running in parallel. My job:

gets a message from the queue and locks it (at least, that's what my engineers are telling me)
processes the message to completion
"completes" the message (so it's no longer in the queue)
exits cleanly

on the off chance the processing fails or the pod dies, the lock will expire (eventually) and a different job will be started to process the message again. Eventually, if no job can process the message, we'll hit the max delivery count and the message will be dead lettered.

with both accurate and eager strategies, when I drop a message on the queue, I see a job start within 30 seconds (as expected). Again, my understanding is that the message is locked...but.....

thirty seconds later, after the next poll, another job starts up and tries to pull a message from the queue and just sits idle while blocking and waiting for a message
another thirty seconds later, another job starts up and again, just sits idle blocking while waiting for a message

meanwhile, the only job actually doing any work is the first job, but now I'm at three running jobs....one processing a message and the other two just sitting around waiting. eventually either a message comes in and one of those two idle jobs will grab it, or no jobs come in and the job hits the activeDeadlineSeconds and appears as a Failed job.

I see the same behavior when using accurate, except after the idle jobs timeout, more jobs are started....meaning it appears like there are always three running jobs....even overnight while nothing is in the queue....every ten minutes one job "Fails" and another job starts..... With eager, once the idle jobs timeout, new ones are not created while the queue is empty

also, in the docs for scaling strategy, I see:

accurate If the scaler returns queueLength (number of items in the queue) that does not include the number of locked messages, this strategy is recommended. Azure Storage Queue is one example. You can use this strategy if you delete a message once your app consumes it.

so my questions are:

how exactly does one confirm if the scaler behaves this way?
why do those jobs get started long after the first job actually pulled the message and started processing it?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better documentation around scaling strategy #1463

better documentation around scaling strategy #1463

jason-berk-k1x commented Aug 31, 2024

better documentation around scaling strategy #1463

better documentation around scaling strategy #1463

Comments

jason-berk-k1x commented Aug 31, 2024