-
Notifications
You must be signed in to change notification settings - Fork 8.5k
Description
The current BaseAlert used to define all of the Stack Monitoring rule types has de-duping logic to avoid alert noise. If a cluster has 20 nodes that suddenly all move into an alertable state for a particular rule, the SM rules will not create 20 alerts, instead, the rule will create 1 single alert for the cluster, listing all of the alerting nodes in the message.
The way we do this is to use a custom alert instance ID: https://github.com/elastic/kibana/blob/master/x-pack/plugins/monitoring/server/alerts/base_alert.ts#L286
${this.alertOptions.id}:${cluster.clusterUuid}:${firingNodeUuids}
where alertOptions.id is something like monitoring_alert_cpu_usage for the CPU rule type, clusterUuid is the id for that ES cluster, and firingNodeUuids is an array of currently firing node IDs joined by a ,. This means that if the list of firing IDs stays constant, this will continue to be one single alert, and actions will be throttled accordingly. However, if that list of firing IDs changes (node(s) stop firing, new node(s) begin firing, etc.), then a new alert instance will be created and new actions will be triggered according to a new throttle schedule.
There are a few problems with this approach, based on the Alerting docs for the services.alertInstanceFactory method.
- The docs say, "Note that the id only needs to be unique within the scope of a specific alert, not unique across all alerts or alert types", so we don't need to prefix these instance IDs with the
alertOptions.id. - By implementing our own custom grouping with these alerts, we may accidentally block our ability to incorporate new features that the alerting framework gives us
- Resolve action groups are currently tricky if not impossible for us to implement because or alerts don't resolve until all nodes on the cluster resolve, although technically each alert instance resolves once the list of firing IDs changes and a new instance is created.
- This ID generation appears to cause problems if a user happens to create two or more of the same kind of rule from a given rule type (TBD on what exact issues, but this PR was reverted because of issues that @igoristic reported).
AC:
- Each firing node should generate its own alert instance (its ID can just be its node ID) which will then have the user's throttling rules applied to it individually.
- Context variables for all SM alert types, along with any default messages, are updated to assume they are per node where applicable
- UI that displays this alert must be able to handle a per node alert (but also handle the old style for backwards-compatibility)