-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExecutionTimeout to fail executions instead of jobs #3974
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
This work is a pre-requirement for job queues as the orchestrator is currently failing jobs that take longer than ExecutionTimeout from the time they were submitted. ExecutionTimeout should only fail executions that have been running for too long, and not fail jobs. If a job haven't started and has been in the queue for too long, then this timeout should take effect as there will be other config to timeout of fail a job has been in the queue for too long or has been retried too many times. A side effect of this PR is that timed out executions will be retried by the scheduler on other nodes until one succeed or no more nodes to retry. This will be handled in next job queue related PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three major points of concern:
- Parallelism in Housekeeping has several issues, one of which the potential for panicking. Suggest adopting the workerpool package I mentioned instead of implementing our own.
- Don't embed
context.Context
in the housekeeping structure. - We need to be mocking time in our tests, especially as this area of the code base grows.
EventTopicJobScheduling models.EventTopic = "Scheduling" | ||
EventTopicJobSubmission models.EventTopic = "Submission" | ||
EventTopicJobScheduling models.EventTopic = "Scheduling" | ||
EventTopicExecutionTimeout models.EventTopic = "Exec Timeout" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid including spaces in topic names. Let's opt for Camel Case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topics main use is improving readability of events and let users understand which component or lifecycle or job orchestration the event is related. Not an internal detail and not intended to be consumed to other applications.
As the goal is improve human readability, allowing spaces can be improve the user experience and will improve how we display them on the CLI and UI without having to truncate them or wrap them at weird palces
Co-authored-by: Forrest <6546409+frrist@users.noreply.github.com>
Not sure about the several issues. You pointed to one issue with a wrong call to
Fixed
Answered inline |
Here's the worker pool package: https://github.com/gammazero/workerpool The several issues were embedded context in structure, and the potential for panic with the wait group. The look to be addressed now will give another review in a sec. |
I've previously looked into the worker pool library you mentioned, and it doesn't enable waiting for housekeeping tasks before calling the iteration as complete and starting another one. Our requirements are pretty simple and I don't see a need for such a library at this point. The semaphore is limiting the number of concurrent tasks, which is what the workerpool library does, and the waitGroup is preventing iterations from overlapping with each other, which the workerpool library is missing. |
This work is a pre-requirement for job queues as the orchestrator is currently failing jobs that take longer than ExecutionTimeout from the time they were submitted. ExecutionTimeout should only fail executions that have been running for too long, and not fail jobs. If a job haven't started and has been in the queue for too long, then this timeout should take effect as there will be other config to timeout of fail a job has been in the queue for too long or has been retried too many times.
A side effect of this PR is that timed out executions will be retried by the scheduler on other nodes until one succeed or no more nodes to retry. This will be handled in next job queue related PRs