Skip to content

Conversation

@Adir111
Copy link
Contributor

@Adir111 Adir111 commented Oct 27, 2025

Task-executor has a livesness probe which checks if the interval occured in the last 10 seconds (configurable), when the interval suppose to happen every 1 second (configurable).
Sometimes, requests to the K8S API would not reach, causing the request the task-executor sent to never return therefore the interval code would "freeze" and would never end.
This would make the liveness probe to fail, thus the task-executor to exit.
In this PR, I have added a mechanism for resilience against this problem - thanks to this mechanism, each request has 1 second to return otherwise it is retried, and if it fails again it will cancel the request and throw an error.
As far as I seen, in the second time it requests - it passed always. Probably when it first sent there was a network problem.

  • The time each requests has is dependent on the interval time (equals).
  • The retry count is 2, meaning if the same request failed twice - the error would be thrown.

Issue - #2278


This change is Reviewable

…ake sure its canceled.

Otherwise the task-executor would exit since livesness probe (in code) would fail.
@Adir111 Adir111 requested a review from RonShvarz October 27, 2025 11:03
@Adir111 Adir111 self-assigned this Oct 27, 2025
Copy link
Contributor

@RonShvarz RonShvarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @Adir111)


core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

const CONTAINERS = containers;

const RETRY_LIMIT = 2;

Should be configureable and have a meaningful name relating to what is being retried


core/task-executor/lib/helpers/kubernetes.js line 22 at r1 (raw file):

                return await Promise.race([
                    fn(),
                    new Promise((_, reject) => setTimeout(() => reject(new Error(`${label} timeout after ${this._defaultTimeoutMs}ms`)), this._defaultTimeoutMs))

Consider slowing down the next atempt by alittle to avoid throttling the k8s api,
make this backoff approach configureable


core/task-executor/lib/helpers/kubernetes.js line 40 at r1 (raw file):

    async init(options = {}) {
        log = Logger.GetLogFromContainer();
        this._defaultTimeoutMs = options.intervalMs;

does options.intervalMs; have a default in the config file ? fi not, it must have

Copy link
Contributor Author

@Adir111 Adir111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @Adir111 and @RonShvarz)


core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

Should be configureable and have a meaningful name relating to what is being retried

Done.

Note - When tested, I never has a situation that more then 2 retries were needed. The second one always returned a response (Probably, this wait time, allowed the request to be reachable, maybe before we retried the throttling of the api was too high so some requests were ignored, can't know for sure the reason of this)


core/task-executor/lib/helpers/kubernetes.js line 22 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

Consider slowing down the next atempt by alittle to avoid throttling the k8s api,
make this backoff approach configureable

I dont think its a good idea - Currently, its configured to be as the interval time, of the reconciler in the task executor.
Currently, its 1 second.
Lets say we make it 2 seconds.
So reconciler starts at 0 seconds,
An API request is being sent at 0.1 sec (theoratically)
The API request never came back
After 2 seconds - 2.1 sec - the api request it retried
This time, it came back
Reconciler finishes - lets say 2.5 seconds

At 1 sec, another interval should have occured, and also at 2 sec. Both wont happen.

At the current situation, only 1 wont happen (since it will finish at 1.5 seconds).

In the worst case, when both tries fail, the reconciler will finish only at 4.1 seconds - skipping 4 intervals, compared to 2 intervals.

Note - usually, when checked the behavior of this - the second call always came back.

If you want - I can make a configurable value which is a factor for the timer - default to 1 (the time will be defaultTimeoutMs * factor) - Let me know what your decision is.


core/task-executor/lib/helpers/kubernetes.js line 40 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

does options.intervalMs; have a default in the config file ? fi not, it must have

yes:
config.intervalMs = process.env.INTERVAL_MS || '3000';

Its the interval time of the task-executor - the jobs requesting and creating process (reconciler).

Copy link
Contributor

@RonShvarz RonShvarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @Adir111)


core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

Previously, Adir111 (Adir David) wrote…

Done.

Note - When tested, I never has a situation that more then 2 retries were needed. The second one always returned a response (Probably, this wait time, allowed the request to be reachable, maybe before we retried the throttling of the api was too high so some requests were ignored, can't know for sure the reason of this)

Still retries should be configureable, even if it wont be configured anytime,
default it with the value which you have found useful, and give it a meaningful name since retries in the config file might be to ambiguious

Copy link
Contributor Author

@Adir111 Adir111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @Adir111 and @RonShvarz)


core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

Still retries should be configureable, even if it wont be configured anytime,
default it with the value which you have found useful, and give it a meaningful name since retries in the config file might be to ambiguious

It is, can be found in the config values under:
config.kubernetes.requestAttemptRetryLimit, with default value 2:

requestAttemptRetryLimit: formatter.parseInt(process.env.KUBERNETES\_REQUEST\_RETRY\_LIMIT, 2),

@Adir111
Copy link
Contributor Author

Adir111 commented Nov 3, 2025

/deploy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants