-
Notifications
You must be signed in to change notification settings - Fork 8.5k
Only fetching TaskManager's available tasks once per call to fillPool #61991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
|
@elasticmachine merge upstream |
pmuellr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM, but I defer to @gmmorris :-)
This certainly simplifies the processing model, which is good because this stuff is fairly complex. And in theory, could result in having TM run fewer tasks over time, since it's not aggressively looking for new tasks like it was before. I'd guess that's a wash though - it would only matter for tasks that complete in a very short amount of time, which is unlikely to be the case for alerting and action tasks.
Were you able to run the perf tests to get some before/after numbers?
~ These changes can theoretically decrease task manager's throughput and the expense of making it more uniform. It's rather unlikely that this occurs because as long as the task doesn't completely immediately, it will end up hitting this which stops the claiming of additional tasks. It was primarily throwing off the metrics of what was reported by https://github.com/elastic/kibana/tree/master/x-pack/test/plugin_api_perf/plugins/task_manager_performance because these tasks were completing immediately ~ Upon giving this further thought, this can decrease the throughput of task manager when there's a conflict claiming jobs, as we potentially won't fully fill the queue... 🤔 |
mikecote
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon giving this further thought, this can decrease the throughput of task manager when there's a conflict claiming jobs, as we potentially won't fully fill the queue... 🤔
Conflicts when claiming tasks should no longer happen after we moved to update by query. We've deferred to updateByQuery to handle OCC and the request only returns when it reached the max docs size or the limit of available tasks. The function sweepForClaimedTasks that runs after shouldn't see errors often.
LGTM but would also like @gmmorris's approval 👍
|
The use of the
If you spin up concurrent instances of Kibana, set |
|
The downside to this approach is that it means that you an end up with a lot of dead time between polling cycles. The plan, as I mentioned on our call last week, was to drop this while loop in favour of a reactive model where a worker that completes will then trigger a fresh poll for work and I intended on keeping the while loop there until that change is made. Merging this suggests a drop in that capacity, no? |
That's interesting. When I ran 3 kibana locally I couldn't see these clashes 🤔 If that's the case then it seems that removing that while loop now is premature as it will mean even worse performance on TM's side. Happy to jump on call to talk this over, as I might have missed something. |
gmmorris
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@elasticmachine merge upstream |
|
@elasticmachine merge upstream |
|
During my performance testing using SIEM's rules, these changes didn't have any impact on the throughput when there was a large backlog of tasks to be ran and there was only a single instance of Kibana claiming tasks. |
|
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Test FailuresKibana Pipeline / kibana-oss-agent / Example plugin functional tests.test/examples/embeddables/adding_children·ts.embeddable explorer creating and adding children Can create a new childStandard OutStack TraceHistory
To update your PR or re-run it, just comment with: |
Summary
The
TaskPolleris responsible for determining when new tasks should be claimed, and it does so by default every 3 seconds or when a consumer explicit callsrunNow. Before these changes,fillPoolwas using awhileloop to continue to claim new tasks until there were either no new tasks to claim or we exceeded the capacity of theTaskPool. This causedfillPoolto essentially violate the flow-control whichTaskPollerimplemented.When there's a backlog of tasks which exceed the capacity of the
TaskPoolthat all finish immediately, for example the performance test tasks, this can cause the following sequence of events:Before changes
or occasionally
After changes
Checklist
Delete any items that are not applicable to this PR.
For maintainers
"Release Note: Resolving issue with the task manager continuing to claim tasks when there's a backlog of tasks to be ran, and the task completes immediately"