Skip to content

Conversation

@gmmorris
Copy link
Contributor

Backports the following commits to 7.x:

…overs from failure (elastic#75420)

Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:
1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller  has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
@gmmorris gmmorris added the backport This PR is a backport of another PR label Aug 21, 2020
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit 4343c8d into elastic:7.x Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport This PR is a backport of another PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants