[Alerting] Investigate resilience / side effects of excessively long running rules

We've seen certain rules run for excessively long duration, far exceeding what we had expected to see rules run.
For example, we've seen large customers experience execution durations of well over 10 minutes (our general expectation was that rules would several seconds, definitely not several minutes).

This is concerning, as such a behaviour could have side effects we're not aware of.
We should investigate the possible implications of this, and track how often this happens using logging and telemetry.

# What might be happening?
Task Manager [is configured to](https://github.com/elastic/kibana/blob/4a541883557bcee83baf09b2c0ddab702f780e45/x-pack/plugins/alerting/server/task_runner/alert_task_instance.test.ts#L53) assume that a task (in this case, a rule task) has "timed out" after 5 minutes, at which point it tries to run it again.

Due to the distributed nature of Kibana instances, the attempt to rerun the rule task might be picked up by any Kibana instance (not necessarily the Kibana that ran it before). As we have no way of knowing if a rule is in fact still running or whether it has in fact crashed and "timed out", we assume the rule has in fact failed (as we had not expected a healthy rule task to ever run this long) and try to rerun it.

At that point, if the rule is in fact still running (and has simply exceeded 5 minutes), we likely end up with two instances of the same rule running twice in parallel. We aren't sure what the side effect of this might be, but likely one instance wil end up overwriting the result of the other - this is an unhealthy state, would likely have unexpected consequences and shouldn't happen.

Additionally, it's worth noting that most of the time this execution duration **far exceeds** the _interval_ configured by the customer. This means that a rule might be configured to run _every 1 minute_, but ends up running every _11 minutes_, or worse, every _random amount of time above 10 minutes_.

# What should we do?
Feel free to change this list of actions, but this is what I _think_ we should do of the top of my head:

- [x] Validate the current behaviour (the above mentioned scenario was validated, but we haven't investigated the consequences yet) and update this issue
- [x] Add logging (Kibana Server/ Event Log?) and telemetry so we can track how often task executions actually exceed the "retryAt", and specifically which task types this happens with (it might be all rules, it might be specific ones, we don't know at this point).
- [x] Investigate possible guardrails we can add to prevent this from happening (see points below)

## Thoughts on guardrails
Adding guardrails around this at framework level is very difficult, but presumably our goal is to reduce the likelihood of this kind of thing happening.

Directions worth exploring:

1. Can we tailor the _retryAt_ of specific rule types to some kind of cluster level average? We track `p90`/`p99` of each task type in Task Manager's health stats anyway, can we use that perhaps (when it tends to exceed `5m`)?
2. Can we prevent rule types that run that long from ever being released into production? We know these rules didn't run that long in Dev/Test, but they do run this long on large datasets in Prod, so how can we catch that sooner?
3. Can we use some kind of preview mechanism to test the rule execution time during the rule creation step? I don't want to block a user for too long when creating the rule, but perhaps we can offer them the option to try it and that would surface the long execution time sooner?
4. Can we `cancel` the rule when it exceeds the `5m` execution time? We know we can't (currently) cancel the ES/SO queries performed by the implementation, but perhaps we can cancel it at framework level so that it's result doesn't actually get saved at framework level?
5. Can we prevent a customer from setting an interval that is shorter than the time it took to run the preview (if we have a preview)?
6. Can we prevent a customer from setting an interval that is shorter than the average time it takes to run these rule types in this cluster?
7. Can we add UX in the rule creation flyout that tells users what the average execution time of a rule type is and warn them when their _interval_ is below that average?

I'm sure there are more directions, but these feel like a god starting point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

gmmorris
openedon Sep 6, 2021

What might be happening?

What should we do?

Thoughts on guardrails

Assignees

Labels

Type

Projects

Milestone

Relationships

Development