added resilience mechanism - if a request crashes gets no response, m… #2277

Adir111 · 2025-10-27T11:03:57Z

Task-executor has a livesness probe which checks if the interval occured in the last 10 seconds (configurable), when the interval suppose to happen every 1 second (configurable).
Sometimes, requests to the K8S API would not reach, causing the request the task-executor sent to never return therefore the interval code would "freeze" and would never end.
This would make the liveness probe to fail, thus the task-executor to exit.
In this PR, I have added a mechanism for resilience against this problem - thanks to this mechanism, each request has 1 second to return otherwise it is retried, and if it fails again it will cancel the request and throw an error.
As far as I seen, in the second time it requests - it passed always. Probably when it first sent there was a network problem.

The time each requests has is dependent on the interval time (equals).
The retry count is 2, meaning if the same request failed twice - the error would be thrown.

Issue - #2278

This change is

…ake sure its canceled. Otherwise the task-executor would exit since livesness probe (in code) would fail.

RonShvarz

Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @Adir111)

core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

const CONTAINERS = containers;

const RETRY_LIMIT = 2;

Should be configureable and have a meaningful name relating to what is being retried

core/task-executor/lib/helpers/kubernetes.js line 22 at r1 (raw file):

                return await Promise.race([
                    fn(),
                    new Promise((_, reject) => setTimeout(() => reject(new Error(`${label} timeout after ${this._defaultTimeoutMs}ms`)), this._defaultTimeoutMs))

Consider slowing down the next atempt by alittle to avoid throttling the k8s api,
make this backoff approach configureable

core/task-executor/lib/helpers/kubernetes.js line 40 at r1 (raw file):

    async init(options = {}) {
        log = Logger.GetLogFromContainer();
        this._defaultTimeoutMs = options.intervalMs;

does options.intervalMs; have a default in the config file ? fi not, it must have

Adir111

Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @Adir111 and @RonShvarz)

core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

Should be configureable and have a meaningful name relating to what is being retried

Done.

Note - When tested, I never has a situation that more then 2 retries were needed. The second one always returned a response (Probably, this wait time, allowed the request to be reachable, maybe before we retried the throttling of the api was too high so some requests were ignored, can't know for sure the reason of this)

core/task-executor/lib/helpers/kubernetes.js line 22 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

Consider slowing down the next atempt by alittle to avoid throttling the k8s api,
make this backoff approach configureable

I dont think its a good idea - Currently, its configured to be as the interval time, of the reconciler in the task executor.
Currently, its 1 second.
Lets say we make it 2 seconds.
So reconciler starts at 0 seconds,
An API request is being sent at 0.1 sec (theoratically)
The API request never came back
After 2 seconds - 2.1 sec - the api request it retried
This time, it came back
Reconciler finishes - lets say 2.5 seconds

At 1 sec, another interval should have occured, and also at 2 sec. Both wont happen.

At the current situation, only 1 wont happen (since it will finish at 1.5 seconds).

In the worst case, when both tries fail, the reconciler will finish only at 4.1 seconds - skipping 4 intervals, compared to 2 intervals.

Note - usually, when checked the behavior of this - the second call always came back.

If you want - I can make a configurable value which is a factor for the timer - default to 1 (the time will be defaultTimeoutMs * factor) - Let me know what your decision is.

core/task-executor/lib/helpers/kubernetes.js line 40 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

does options.intervalMs; have a default in the config file ? fi not, it must have

yes:
config.intervalMs = process.env.INTERVAL_MS || '3000';

Its the interval time of the task-executor - the jobs requesting and creating process (reconciler).

…HPC/hkube into added_resilience_mechanism

RonShvarz

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @Adir111)

core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

Previously, Adir111 (Adir David) wrote…

Done.

Note - When tested, I never has a situation that more then 2 retries were needed. The second one always returned a response (Probably, this wait time, allowed the request to be reachable, maybe before we retried the throttling of the api was too high so some requests were ignored, can't know for sure the reason of this)

Still retries should be configureable, even if it wont be configured anytime,
default it with the value which you have found useful, and give it a meaningful name since retries in the config file might be to ambiguious

Adir111

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @Adir111 and @RonShvarz)

core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):

Previously, RonShvarz (Ron Shvarzburd) wrote…

Still retries should be configureable, even if it wont be configured anytime,
default it with the value which you have found useful, and give it a meaningful name since retries in the config file might be to ambiguious

It is, can be found in the config values under:
config.kubernetes.requestAttemptRetryLimit, with default value 2:

requestAttemptRetryLimit: formatter.parseInt(process.env.KUBERNETES\_REQUEST\_RETRY\_LIMIT, 2),

Adir111 · 2025-11-03T10:28:09Z

/deploy

added resilience mechanism - if a request crashes gets no response, m…

c3d38a3

…ake sure its canceled. Otherwise the task-executor would exit since livesness probe (in code) would fail.

Adir111 requested a review from RonShvarz October 27, 2025 11:03

Adir111 self-assigned this Oct 27, 2025

github-actions bot added the HKube: Task Executor label Oct 27, 2025

Merge branch 'master' into added_resilience_mechanism

8f97432

RonShvarz requested changes Oct 29, 2025

View reviewed changes

Adir111 commented Oct 29, 2025

View reviewed changes

Adir111 added 2 commits October 29, 2025 11:34

Make the retyr limit configurable + added jsdoc

dae4f8b

Merge branch 'added_resilience_mechanism' of https://github.com/kube-…

b0826e6

…HPC/hkube into added_resilience_mechanism

RonShvarz requested changes Oct 29, 2025

View reviewed changes

Adir111 commented Nov 2, 2025

View reviewed changes

hkube-ci deployed to dev November 3, 2025 10:28 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added resilience mechanism - if a request crashes gets no response, m… #2277

added resilience mechanism - if a request crashes gets no response, m… #2277

Uh oh!

Adir111 commented Oct 27, 2025 •

edited

Loading

Uh oh!

RonShvarz left a comment

Uh oh!

Adir111 left a comment

Uh oh!

RonShvarz left a comment

Uh oh!

Adir111 left a comment

Uh oh!

Adir111 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

added resilience mechanism - if a request crashes gets no response, m… #2277

Are you sure you want to change the base?

added resilience mechanism - if a request crashes gets no response, m… #2277

Uh oh!

Conversation

Adir111 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonShvarz left a comment

Choose a reason for hiding this comment

Uh oh!

Adir111 left a comment

Choose a reason for hiding this comment

Uh oh!

RonShvarz left a comment

Choose a reason for hiding this comment

Uh oh!

Adir111 left a comment

Choose a reason for hiding this comment

Uh oh!

Adir111 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adir111 commented Oct 27, 2025 •

edited

Loading