Skip to content

Conversation

@kobelb
Copy link
Contributor

@kobelb kobelb commented Mar 31, 2020

Summary

The TaskPoller is responsible for determining when new tasks should be claimed, and it does so by default every 3 seconds or when a consumer explicit calls runNow. Before these changes, fillPool was using a while loop to continue to claim new tasks until there were either no new tasks to claim or we exceeded the capacity of the TaskPool. This caused fillPool to essentially violate the flow-control which TaskPoller implemented.

When there's a backlog of tasks which exceed the capacity of the TaskPool that all finish immediately, for example the performance test tasks, this can cause the following sequence of events:

Before changes

TaskPoller:startWork
Claimed 4 tasks
Claimed 0 tasks
TaskPoller:endWork

or occasionally

TaskPoller:startWork
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 8 tasks
Claimed 1 tasks
Claimed 3 tasks
Claimed 0 tasks
TaskPoller:endWork

After changes

TaskPoller:startWork
Claimed 8 tasks
TaskPoller:endWork

Checklist

Delete any items that are not applicable to this PR.

For maintainers

"Release Note: Resolving issue with the task manager continuing to claim tasks when there's a backlog of tasks to be ran, and the task completes immediately"

@kobelb kobelb added release_note:fix v8.0.0 v7.7.0 Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.8.0 labels Mar 31, 2020
@kobelb kobelb requested review from gmmorris and mikecote March 31, 2020 15:08
@kobelb kobelb requested a review from a team as a code owner March 31, 2020 15:08
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@kobelb
Copy link
Contributor Author

kobelb commented Mar 31, 2020

@elasticmachine merge upstream

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM, but I defer to @gmmorris :-)

This certainly simplifies the processing model, which is good because this stuff is fairly complex. And in theory, could result in having TM run fewer tasks over time, since it's not aggressively looking for new tasks like it was before. I'd guess that's a wash though - it would only matter for tasks that complete in a very short amount of time, which is unlikely to be the case for alerting and action tasks.

Were you able to run the perf tests to get some before/after numbers?

@kobelb
Copy link
Contributor Author

kobelb commented Mar 31, 2020

Were you able to run the perf tests to get some before/after numbers?

~ These changes can theoretically decrease task manager's throughput and the expense of making it more uniform. It's rather unlikely that this occurs because as long as the task doesn't completely immediately, it will end up hitting this which stops the claiming of additional tasks. It was primarily throwing off the metrics of what was reported by https://github.com/elastic/kibana/tree/master/x-pack/test/plugin_api_perf/plugins/task_manager_performance because these tasks were completing immediately ~

Upon giving this further thought, this can decrease the throughput of task manager when there's a conflict claiming jobs, as we potentially won't fully fill the queue... 🤔

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon giving this further thought, this can decrease the throughput of task manager when there's a conflict claiming jobs, as we potentially won't fully fill the queue... 🤔

Conflicts when claiming tasks should no longer happen after we moved to update by query. We've deferred to updateByQuery to handle OCC and the request only returns when it reached the max docs size or the limit of available tasks. The function sweepForClaimedTasks that runs after shouldn't see errors often.

LGTM but would also like @gmmorris's approval 👍

@kobelb
Copy link
Contributor Author

kobelb commented Apr 1, 2020

The use of the updateByQuery can still create conflicts per https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query

_update_by_query gets a snapshot of the index when it starts and indexes what it finds using internal versioning. That means you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the index request is processed. When the versions match, the document is updated and the version number is incremented.

If you spin up concurrent instances of Kibana, set xpack.task_manager.polling_interval: 100, and then use the performance testing tasks plugin to insert 20 tasks which the different instances all fight over, you can see these conflicts occurring.

@gmmorris
Copy link
Contributor

gmmorris commented Apr 2, 2020

The downside to this approach is that it means that you an end up with a lot of dead time between polling cycles.
When experimenting with scaling TM it meant that at the very most I could only execute MAX_WORKER tasks per cycle.

The plan, as I mentioned on our call last week, was to drop this while loop in favour of a reactive model where a worker that completes will then trigger a fresh poll for work and I intended on keeping the while loop there until that change is made.

Merging this suggests a drop in that capacity, no?

@gmmorris
Copy link
Contributor

gmmorris commented Apr 2, 2020

If you spin up concurrent instances of Kibana, set xpack.task_manager.polling_interval: 100, and then use the performance testing tasks plugin to insert 20 tasks which the different instances all fight over, you can see these conflicts occurring.

That's interesting. When I ran 3 kibana locally I couldn't see these clashes 🤔
But that does align with what @nik9000 was telling me, so more likely I screwed up the test (it Was about two months into joining Elastic... so... I might have missed something).

If that's the case then it seems that removing that while loop now is premature as it will mean even worse performance on TM's side.

Happy to jump on call to talk this over, as I might have missed something.

@kobelb kobelb removed the v7.7.0 label Apr 3, 2020
Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a chat with @kobelb (see here) and as long as this doesn't get backported into 7.7, I'm 👍

@kobelb
Copy link
Contributor Author

kobelb commented Apr 10, 2020

@elasticmachine merge upstream

@kobelb
Copy link
Contributor Author

kobelb commented Apr 10, 2020

@elasticmachine merge upstream

@kobelb
Copy link
Contributor Author

kobelb commented Apr 10, 2020

During my performance testing using SIEM's rules, these changes didn't have any impact on the throughput when there was a large backlog of tasks to be ran and there was only a single instance of Kibana claiming tasks.

@kobelb
Copy link
Contributor Author

kobelb commented Apr 13, 2020

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💛 Build succeeded, but was flaky


Test Failures

Kibana Pipeline / kibana-oss-agent / Example plugin functional tests.test/examples/embeddables/adding_children·ts.embeddable explorer creating and adding children Can create a new child

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has failed 5 times on tracked branches: https://github.com/elastic/kibana/issues/58692

[00:00:00]       │
[00:00:34]         └-: embeddable explorer
[00:00:34]           └-> "before all" hook
[00:00:34]           └-> "before all" hook
[00:00:34]             │ debg navigating to settings url: http://localhost:6111/app/kibana#/management
[00:00:34]             │ debg navigate to: http://localhost:6111/app/kibana#/management
[00:00:34]             │ info [o.e.c.m.MetadataCreateIndexService] [kibana-ci-immutable-ubuntu-16-tests-xl-1586786030454222970] applying create index request using v1 templates []
[00:00:35]             │ info [o.e.c.m.MetadataCreateIndexService] [kibana-ci-immutable-ubuntu-16-tests-xl-1586786030454222970] [.kibana] creating index, cause [auto(bulk api)], templates [], shards [1]/[1], mappings []
[00:00:35]             │ info [o.e.c.m.MetadataMappingService] [kibana-ci-immutable-ubuntu-16-tests-xl-1586786030454222970] [.kibana/zLZ2W5t9Q4CQTtfh_EqynA] create_mapping
[00:00:36]             │ info [o.e.c.m.MetadataMappingService] [kibana-ci-immutable-ubuntu-16-tests-xl-1586786030454222970] [.kibana/zLZ2W5t9Q4CQTtfh_EqynA] update_mapping [_doc]
[00:00:36]             │ info [o.e.c.m.MetadataMappingService] [kibana-ci-immutable-ubuntu-16-tests-xl-1586786030454222970] [.kibana/zLZ2W5t9Q4CQTtfh_EqynA] update_mapping [_doc]
[00:00:36]             │ debg browser[DEBUG] http://localhost:6111/bundles/plugin/usageCollection/usageCollection.plugin.js 974:16 "Reporting application changed to undefined"
[00:00:36]             │ debg browser[DEBUG] http://localhost:6111/bundles/plugin/usageCollection/usageCollection.plugin.js 974:16 "Reporting application changed to undefined"
[00:00:36]             │ debg browser[DEBUG] http://localhost:6111/bundles/plugin/usageCollection/usageCollection.plugin.js 974:16 "Error Sending Metrics Report TypeError: Failed to fetch"
[00:00:36]             │ debg browser[INFO] http://localhost:6111/app/kibana?_t=1586787836626#/management 341 Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'unsafe-eval' 'self'". Either the 'unsafe-inline' keyword, a hash ('sha256-P5polb1UreUSOe5V/Pv7tc+yeZuJXiOi/3fqhGsU7BE='), or a nonce ('nonce-...') is required to enable inline execution.
[00:00:36]             │
[00:00:36]             │ debg browser[INFO] http://localhost:6111/bundles/app/kibana/bootstrap.js 9:19 "^ A single error about an inline script not firing due to content security policy is expected!"
[00:00:36]             │ debg ... sleep(700) start
[00:00:36]             │ debg ... sleep(700) end
[00:00:36]             │ debg returned from get, calling refresh
[00:00:37]             │ debg browser[INFO] http://localhost:6111/app/kibana?_t=1586787836626#/management 341 Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'unsafe-eval' 'self'". Either the 'unsafe-inline' keyword, a hash ('sha256-P5polb1UreUSOe5V/Pv7tc+yeZuJXiOi/3fqhGsU7BE='), or a nonce ('nonce-...') is required to enable inline execution.
[00:00:37]             │
[00:00:37]             │ debg browser[INFO] http://localhost:6111/bundles/app/kibana/bootstrap.js 9:19 "^ A single error about an inline script not firing due to content security policy is expected!"
[00:00:37]             │ debg currentUrl = http://localhost:6111/app/kibana#/management
[00:00:37]             │          appUrl = http://localhost:6111/app/kibana#/management
[00:00:37]             │ debg TestSubjects.find(kibanaChrome)
[00:00:37]             │ debg Find.findByCssSelector('[data-test-subj="kibanaChrome"]') with timeout=60000
[00:00:39]             │ debg TestSubjects.find(kibanaChrome)
[00:00:39]             │ debg Find.findByCssSelector('[data-test-subj="kibanaChrome"]') with timeout=10000
[00:00:39]             │ debg browser[INFO] http://localhost:6111/bundles/plugin/testbed/testbed.plugin.js 171:12 "Testbed plugin set up. uiProp: 'Accessible from client'"
[00:00:39]             │ debg browser[INFO] http://localhost:6111/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js 380:106115 "INFO: 2020-04-13T14:24:00Z
[00:00:39]             │        Adding connection to http://localhost:6111/elasticsearch
[00:00:39]             │
[00:00:39]             │      "
[00:00:39]             │ debg browser[DEBUG] http://localhost:6111/bundles/plugin/usageCollection/usageCollection.plugin.js 974:16 "Reporting user-agent."
[00:00:39]             │ debg browser[INFO] http://localhost:6111/bundles/plugin/testbed/testbed.plugin.js 179:12 "Testbed plugin started"
[00:00:39]             │ debg browser[DEBUG] http://localhost:6111/bundles/plugin/usageCollection/usageCollection.plugin.js 974:16 "Reporting application changed to management"
[00:00:39]             │ debg ... sleep(501) start
[00:00:40]             │ debg ... sleep(501) end
[00:00:40]             │ debg in navigateTo url = http://localhost:6111/app/kibana#/management?_g=()
[00:00:40]             │ debg TestSubjects.exists(statusPageContainer)
[00:00:40]             │ debg Find.existsByDisplayedByCssSelector('[data-test-subj="statusPageContainer"]') with timeout=2500
[00:00:42]             │ debg --- retry.tryForTime error: [data-test-subj="statusPageContainer"] is not displayed
[00:00:43]             │ debg click "Embeddable explorer" app link
[00:00:43]             │ debg TestSubjects.find(navDrawer)
[00:00:43]             │ debg Find.findByCssSelector('[data-test-subj="navDrawer"]') with timeout=10000
[00:00:48]           └-: creating and adding children
[00:00:48]             └-> "before all" hook
[00:00:48]             └-> "before all" hook
[00:00:48]               │ debg TestSubjects.click(embeddablePanelExamplae)
[00:00:48]               │ debg Find.clickByCssSelector('[data-test-subj="embeddablePanelExamplae"]') with timeout=10000
[00:00:48]               │ debg Find.findByCssSelector('[data-test-subj="embeddablePanelExamplae"]') with timeout=10000
[00:00:48]             └-> Can create a new child
[00:00:48]               └-> "before each" hook: global before each
[00:00:48]               │ debg TestSubjects.click(embeddablePanelToggleMenuIcon)
[00:00:48]               │ debg Find.clickByCssSelector('[data-test-subj="embeddablePanelToggleMenuIcon"]') with timeout=10000
[00:00:48]               │ debg Find.findByCssSelector('[data-test-subj="embeddablePanelToggleMenuIcon"]') with timeout=10000
[00:00:48]               │ debg TestSubjects.click(embeddablePanelAction-ACTION_ADD_PANEL)
[00:00:48]               │ debg Find.clickByCssSelector('[data-test-subj="embeddablePanelAction-ACTION_ADD_PANEL"]') with timeout=10000
[00:00:48]               │ debg Find.findByCssSelector('[data-test-subj="embeddablePanelAction-ACTION_ADD_PANEL"]') with timeout=10000
[00:00:48]               │ debg TestSubjects.click(createNew)
[00:00:48]               │ debg Find.clickByCssSelector('[data-test-subj="createNew"]') with timeout=10000
[00:00:48]               │ debg Find.findByCssSelector('[data-test-subj="createNew"]') with timeout=10000
[00:00:49]               │ debg TestSubjects.click(createNew-TODO_EMBEDDABLE)
[00:00:49]               │ debg Find.clickByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:00:49]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:00:59]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:00:59]               │      Wait timed out after 10030ms
[00:00:59]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:01:09]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:01:09]               │      Wait timed out after 10058ms
[00:01:10]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:01:20]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:01:20]               │      Wait timed out after 10034ms
[00:01:20]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:01:30]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:01:30]               │      Wait timed out after 10043ms
[00:01:31]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:01:41]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:01:41]               │      Wait timed out after 10054ms
[00:01:41]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:01:51]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:01:51]               │      Wait timed out after 10026ms
[00:01:52]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:02:02]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:02:02]               │      Wait timed out after 10039ms
[00:02:02]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:02:12]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:02:12]               │      Wait timed out after 10014ms
[00:02:13]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:02:23]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:02:23]               │      Wait timed out after 10017ms
[00:02:23]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:02:34]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:02:34]               │      Wait timed out after 10052ms
[00:02:34]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:02:44]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:02:44]               │      Wait timed out after 10050ms
[00:02:45]               │ debg Find.findByCssSelector('[data-test-subj="createNew-TODO_EMBEDDABLE"]') with timeout=10000
[00:02:55]               │ debg --- retry.try error: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
[00:02:55]               │      Wait timed out after 10009ms
[00:02:55]               │ info Taking screenshot "/dev/shm/workspace/kibana/test/functional/screenshots/failure/embeddable explorer creating and adding children Can create a new child.png"
[00:02:55]               │ info Current URL is: http://localhost:6111/app/embeddableExplorer/embeddablePanelExamplae
[00:02:55]               │ info Saving page source to: /dev/shm/workspace/kibana/test/examples/failure_debug/html/embeddable explorer creating and adding children Can create a new child.html
[00:02:55]               └- ✖ fail: "embeddable explorer creating and adding children Can create a new child"
[00:02:55]               │

Stack Trace

Error: retry.try timeout: TimeoutError: Waiting for element to be located By(css selector, [data-test-subj="createNew-TODO_EMBEDDABLE"])
Wait timed out after 10009ms
    at /dev/shm/workspace/kibana/node_modules/selenium-webdriver/lib/webdriver.js:842:17
    at process._tickCallback (internal/process/next_tick.js:68:7)
    at onFailure (test/common/services/retry/retry_for_success.ts:28:9)
    at retryForSuccess (test/common/services/retry/retry_for_success.ts:68:13)

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kobelb kobelb merged commit 52747c9 into elastic:master Apr 13, 2020
@kobelb kobelb deleted the task-manager/fill-pool-once branch April 13, 2020 16:02
kobelb added a commit that referenced this pull request Apr 13, 2020
…#61991) (#63369)

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_note:fix Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.8.0 v8.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants