Skip to content

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

Open

Description

We've seen certain rules run for excessively long duration, far exceeding what we had expected to see rules run.
For example, we've seen large customers experience execution durations of well over 10 minutes (our general expectation was that rules would several seconds, definitely not several minutes).

This is concerning, as such a behaviour could have side effects we're not aware of.
We should investigate the possible implications of this, and track how often this happens using logging and telemetry.

What might be happening?

Task Manager is configured to assume that a task (in this case, a rule task) has "timed out" after 5 minutes, at which point it tries to run it again.

Due to the distributed nature of Kibana instances, the attempt to rerun the rule task might be picked up by any Kibana instance (not necessarily the Kibana that ran it before). As we have no way of knowing if a rule is in fact still running or whether it has in fact crashed and "timed out", we assume the rule has in fact failed (as we had not expected a healthy rule task to ever run this long) and try to rerun it.

At that point, if the rule is in fact still running (and has simply exceeded 5 minutes), we likely end up with two instances of the same rule running twice in parallel. We aren't sure what the side effect of this might be, but likely one instance wil end up overwriting the result of the other - this is an unhealthy state, would likely have unexpected consequences and shouldn't happen.

Additionally, it's worth noting that most of the time this execution duration far exceeds the interval configured by the customer. This means that a rule might be configured to run every 1 minute, but ends up running every 11 minutes, or worse, every random amount of time above 10 minutes.

What should we do?

Feel free to change this list of actions, but this is what I think we should do of the top of my head:

  • Validate the current behaviour (the above mentioned scenario was validated, but we haven't investigated the consequences yet) and update this issue
  • Add logging (Kibana Server/ Event Log?) and telemetry so we can track how often task executions actually exceed the "retryAt", and specifically which task types this happens with (it might be all rules, it might be specific ones, we don't know at this point).
  • Investigate possible guardrails we can add to prevent this from happening (see points below)

Thoughts on guardrails

Adding guardrails around this at framework level is very difficult, but presumably our goal is to reduce the likelihood of this kind of thing happening.

Directions worth exploring:

  1. Can we tailor the retryAt of specific rule types to some kind of cluster level average? We track p90/p99 of each task type in Task Manager's health stats anyway, can we use that perhaps (when it tends to exceed 5m)?
  2. Can we prevent rule types that run that long from ever being released into production? We know these rules didn't run that long in Dev/Test, but they do run this long on large datasets in Prod, so how can we catch that sooner?
  3. Can we use some kind of preview mechanism to test the rule execution time during the rule creation step? I don't want to block a user for too long when creating the rule, but perhaps we can offer them the option to try it and that would surface the long execution time sooner?
  4. Can we cancel the rule when it exceeds the 5m execution time? We know we can't (currently) cancel the ES/SO queries performed by the implementation, but perhaps we can cancel it at framework level so that it's result doesn't actually get saved at framework level?
  5. Can we prevent a customer from setting an interval that is shorter than the time it took to run the preview (if we have a preview)?
  6. Can we prevent a customer from setting an interval that is shorter than the average time it takes to run these rule types in this cluster?
  7. Can we add UX in the rule creation flyout that tells users what the average execution time of a rule type is and warn them when their interval is below that average?

I'm sure there are more directions, but these feel like a god starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Feature:Alerting/RulesFrameworkIssues related to the Alerting Rules FrameworkMetaTeam:ResponseOpsLabel for the ResponseOps team (formerly the Cases and Alerting teams)estimate:needs-researchEstimated as too large and requires research to break down into workable issuesimpact:criticalThis issue should be addressed immediately due to a critical level of impact on the product.insightIssues related to user insight into platform operations and resilienceresilienceIssues related to Platform resilience in terms of scale, performance & backwards compatibility

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions