-
Notifications
You must be signed in to change notification settings - Fork 21
added resilience mechanism - if a request crashes gets no response, m… #2277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ake sure its canceled. Otherwise the task-executor would exit since livesness probe (in code) would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @Adir111)
core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):
const CONTAINERS = containers; const RETRY_LIMIT = 2;
Should be configureable and have a meaningful name relating to what is being retried
core/task-executor/lib/helpers/kubernetes.js line 22 at r1 (raw file):
return await Promise.race([ fn(), new Promise((_, reject) => setTimeout(() => reject(new Error(`${label} timeout after ${this._defaultTimeoutMs}ms`)), this._defaultTimeoutMs))
Consider slowing down the next atempt by alittle to avoid throttling the k8s api,
make this backoff approach configureable
core/task-executor/lib/helpers/kubernetes.js line 40 at r1 (raw file):
async init(options = {}) { log = Logger.GetLogFromContainer(); this._defaultTimeoutMs = options.intervalMs;
does options.intervalMs; have a default in the config file ? fi not, it must have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions (waiting on @Adir111 and @RonShvarz)
core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):
Previously, RonShvarz (Ron Shvarzburd) wrote…
Should be configureable and have a meaningful name relating to what is being retried
Done.
Note - When tested, I never has a situation that more then 2 retries were needed. The second one always returned a response (Probably, this wait time, allowed the request to be reachable, maybe before we retried the throttling of the api was too high so some requests were ignored, can't know for sure the reason of this)
core/task-executor/lib/helpers/kubernetes.js line 22 at r1 (raw file):
Previously, RonShvarz (Ron Shvarzburd) wrote…
Consider slowing down the next atempt by alittle to avoid throttling the k8s api,
make this backoff approach configureable
I dont think its a good idea - Currently, its configured to be as the interval time, of the reconciler in the task executor.
Currently, its 1 second.
Lets say we make it 2 seconds.
So reconciler starts at 0 seconds,
An API request is being sent at 0.1 sec (theoratically)
The API request never came back
After 2 seconds - 2.1 sec - the api request it retried
This time, it came back
Reconciler finishes - lets say 2.5 seconds
At 1 sec, another interval should have occured, and also at 2 sec. Both wont happen.
At the current situation, only 1 wont happen (since it will finish at 1.5 seconds).
In the worst case, when both tries fail, the reconciler will finish only at 4.1 seconds - skipping 4 intervals, compared to 2 intervals.
Note - usually, when checked the behavior of this - the second call always came back.
If you want - I can make a configurable value which is a factor for the timer - default to 1 (the time will be defaultTimeoutMs * factor) - Let me know what your decision is.
core/task-executor/lib/helpers/kubernetes.js line 40 at r1 (raw file):
Previously, RonShvarz (Ron Shvarzburd) wrote…
does options.intervalMs; have a default in the config file ? fi not, it must have
yes:
config.intervalMs = process.env.INTERVAL_MS || '3000';
Its the interval time of the task-executor - the jobs requesting and creating process (reconciler).
…HPC/hkube into added_resilience_mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @Adir111)
core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):
Previously, Adir111 (Adir David) wrote…
Done.
Note - When tested, I never has a situation that more then 2 retries were needed. The second one always returned a response (Probably, this wait time, allowed the request to be reachable, maybe before we retried the throttling of the api was too high so some requests were ignored, can't know for sure the reason of this)
Still retries should be configureable, even if it wont be configured anytime,
default it with the value which you have found useful, and give it a meaningful name since retries in the config file might be to ambiguious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @Adir111 and @RonShvarz)
core/task-executor/lib/helpers/kubernetes.js line 11 at r1 (raw file):
Previously, RonShvarz (Ron Shvarzburd) wrote…
Still retries should be configureable, even if it wont be configured anytime,
default it with the value which you have found useful, and give it a meaningful name since retries in the config file might be to ambiguious
It is, can be found in the config values under:
config.kubernetes.requestAttemptRetryLimit, with default value 2:
requestAttemptRetryLimit: formatter.parseInt(process.env.KUBERNETES\_REQUEST\_RETRY\_LIMIT, 2),|
/deploy |
Task-executor has a livesness probe which checks if the interval occured in the last 10 seconds (configurable), when the interval suppose to happen every 1 second (configurable).
Sometimes, requests to the K8S API would not reach, causing the request the task-executor sent to never return therefore the interval code would "freeze" and would never end.
This would make the liveness probe to fail, thus the task-executor to exit.
In this PR, I have added a mechanism for resilience against this problem - thanks to this mechanism, each request has 1 second to return otherwise it is retried, and if it fails again it will cancel the request and throw an error.
As far as I seen, in the second time it requests - it passed always. Probably when it first sent there was a network problem.
Issue - #2278
This change is