Skip to content

Commit a0b5dd6

Browse files
committed
[alerting] adds built-in index threshold alert type
resolves #53041
1 parent 457783e commit a0b5dd6

30 files changed

+2385
-0
lines changed
Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# alerting_builtins plugin
2+
3+
This plugin provides alertTypes shipped with Kibana for use with the
4+
[the alerting plugin](../alerting/README.md). When enabled, it will register
5+
the built-in alertTypes with the alerting plugin, register associated HTTP
6+
routes, etc.
7+
8+
The plugin `setup` and `start` contracts for this plugin are the following
9+
type, which provides some runtime capabilities. Each built-in alertType will
10+
have it's own top-level property in the `IService` interface, if it needs to
11+
expose functionality.
12+
13+
```ts
14+
export interface IService {
15+
indexThreshold: {
16+
timeSeriesQuery(params: TimeSeriesQueryParameters): Promise<TimeSeriesResult>;
17+
}
18+
}
19+
```
20+
21+
## built-in alertType index threshold
22+
23+
The index threshold alert type is designed to run an ES query over indices,
24+
aggregating field values from documents, comparing them to threshold values,
25+
and scheduling actions to run when the thresholds are met.
26+
27+
And example would be checking a monitoring index for percent cpu usage field
28+
values that are greater than some threshold, which could then be used to invoke
29+
an action (email, slack, etc) to notify interested parties when the threshold
30+
is exceeded.
31+
32+
### alertType `.index-threshold`
33+
34+
The alertType parameters are specified in
35+
[`index_threshold/lib/core_query_types.ts`][it-core-query]
36+
and
37+
[`index_threshold/alert_type_params.ts`][it-alert-params].
38+
39+
The alertType has a single actionGroup, `'threshold met'`. The `context` object
40+
provided to actions is specified in
41+
[`index_threshold/action_context.ts`][it-alert-context].
42+
43+
[it-alert-params]: server/alert_types/index_threshold/alert_type_params.ts
44+
[it-alert-context]: server/alert_types/index_threshold/action_context.ts
45+
[it-core-query]: server/alert_types/index_threshold/lib/core_query_types.ts
46+
47+
#### example
48+
49+
This example uses [kbn-action][]'s `kbn-alert` command to create the alert,
50+
and [es-hb-sim][] to generate ES documents for the alert to run queries
51+
against.
52+
53+
Start `es-hb-sim`:
54+
55+
```
56+
es-hb-sim 1 es-hb-sim host-A https://elastic:changeme@localhost:9200
57+
```
58+
59+
This will start indexing documents of the following form, to the `es-hb-sim`
60+
index:
61+
62+
```
63+
{"@timestamp":"2020-02-20T22:10:30.011Z","summary":{"up":1,"down":0},"monitor":{"status":"up","name":"host-A"}}
64+
```
65+
66+
Press `u` to have it start writing "down" documents instead of "up" documents.
67+
68+
Create a server log action that we can use with the alert:
69+
70+
```
71+
export ACTION_ID=`kbn-action create .server-log 'server-log' '{}' '{}' | jq -r '.id'`
72+
```
73+
74+
Finally, create the alert:
75+
76+
```
77+
kbn-alert create .index-threshold 'es-hb-sim threshold' 1s \
78+
'{
79+
index: es-hb-sim
80+
timeField: @timestamp
81+
aggType: average
82+
aggField: summary.up
83+
groupField: monitor.name.keyword
84+
window: 5s
85+
comparator: lessThan
86+
threshold: [ 0.6 ]
87+
}' \
88+
"[
89+
{
90+
group: threshold met
91+
id: '$ACTION_ID'
92+
params: {
93+
level: warn
94+
message: '{{context.message}}'
95+
}
96+
}
97+
]"
98+
```
99+
100+
This alert will run a query over the `es-hb-sim` index, using the `@timestamp`
101+
field as the date field, using an `average` aggregation over the `summary.up`
102+
field. The results are then aggregated by `monitor.name.keyword`. If we ran
103+
another instance of `es-hb-sim`, using `host-B` instead of `host-A`, then the
104+
alert will end up potentially scheduling actions for both, independently.
105+
Within the alerting plugin, this grouping is also referred to as "instanceIds"
106+
(`host-A` and `host-B` being distinct instanceIds, which can have actions
107+
scheduled against them independently).
108+
109+
The `window` is set to `5s` which is 5 seconds. That means, every time the
110+
alert runs it's queries (every second, in the example above), it will run it's
111+
ES query over the last 5 seconds. Thus, the queries, over time, will overlap.
112+
Sometimes that's what you want. Other times, maybe you just want to do
113+
sampling, running an alert every hour, with a 5 minute window. Up to the you!
114+
115+
Using the `comparator` `lessThan` and `threshold` `[0.6]`, the alert will
116+
calculate the average of all the `summary.up` fields for each unique
117+
`monitor.name.keyword`, and then if the value is less than 0.6, it will
118+
schedule the specified action (server log) to run. The `message` param
119+
passed to the action includes a mustache template for the context variable
120+
`message`, which is created by the alert type. That message generates
121+
a generic but useful text message, already constructed. Alternatively,
122+
a customer could set the `message` param in the action to a much more
123+
complex message, using other context variables made available by the
124+
alert type.
125+
126+
Here's the message you should see in the Kibana console, if everything is
127+
working:
128+
129+
```
130+
server log [17:32:10.060] [warning][actions][actions][plugins] \
131+
Server log: alert es-hb-sim threshold instance host-A value 0 \
132+
exceeded threshold average(summary.up) lessThan 0.6 over 5s \
133+
on 2020-02-20T22:32:07.000Z
134+
```
135+
136+
[kbn-action]: https://github.com/pmuellr/kbn-action
137+
[es-hb-sim]: https://github.com/pmuellr/es-hb-sim
138+
139+
140+
### http endpoints
141+
142+
An HTTP endpoint is provided to return the values the alertType would calculate,
143+
over a series of time. This is intended to be used in the alerting UI to
144+
provide a "preview" of the alert during creation/editing based on recent data,
145+
and could be used to show a "simulation" of the the alert over an arbitrary
146+
range of time.
147+
148+
The endpoint is `POST /api/alerting_builtins/index_threshold/_time_series_query`.
149+
The request and response bodies are specifed in
150+
[`index_threshold/lib/core_query_types.ts`][it-core-query]
151+
and
152+
[`index_threshold/lib/time_series_types.ts`][it-timeSeries-types].
153+
The request body is very similar to the alertType's parameters.
154+
155+
#### example
156+
157+
Continuing with the example above, here's a query to get the values calculated
158+
for the last 10 seconds:
159+
160+
_note: you'll need to change the `dateStart` and `dateEnd`
161+
values as appropriate_
162+
163+
```console
164+
curl -k "https://elastic:changeme@localhost:5601/api/alerting_builtins/index_threshold/_time_series_query" \
165+
-H "kbn-xsrf: foo" -H "content-type: application/json" -d '{
166+
"index": "es-hb-sim",
167+
"timeField": "@timestamp",
168+
"aggType": "average",
169+
"aggField": "summary.up",
170+
"groupField": "monitor.name.keyword",
171+
"interval": "1s",
172+
"dateStart": "2020-02-26T15:10:40.000Z",
173+
"dateEnd": "2020-02-26T15:10:50.000Z",
174+
"window": "5s"
175+
}'
176+
```
177+
178+
```
179+
{
180+
"results": [
181+
{
182+
"group": "host-A",
183+
"metrics": [
184+
[ "2020-02-26T15:10:40.000Z", 0 ],
185+
[ "2020-02-26T15:10:41.000Z", 0 ],
186+
[ "2020-02-26T15:10:42.000Z", 0 ],
187+
[ "2020-02-26T15:10:43.000Z", 0 ],
188+
[ "2020-02-26T15:10:44.000Z", 0 ],
189+
[ "2020-02-26T15:10:45.000Z", 0 ],
190+
[ "2020-02-26T15:10:46.000Z", 0 ],
191+
[ "2020-02-26T15:10:47.000Z", 0 ],
192+
[ "2020-02-26T15:10:48.000Z", 0 ],
193+
[ "2020-02-26T15:10:49.000Z", 0 ],
194+
[ "2020-02-26T15:10:50.000Z", 0 ]
195+
]
196+
}
197+
]
198+
}
199+
```
200+
201+
To get the current value of the calculated metric, you can leave off the date:
202+
203+
```
204+
curl -k "https://elastic:changeme@localhost:5601/api/alerting_builtins/index_threshold/_time_series_query" \
205+
-H "kbn-xsrf: foo" -H "content-type: application/json" -d '{
206+
"index": "es-hb-sim",
207+
"timeField": "@timestamp",
208+
"aggType": "average",
209+
"aggField": "summary.up",
210+
"groupField": "monitor.name.keyword",
211+
"interval": "1s",
212+
"window": "5s"
213+
}'
214+
```
215+
216+
```
217+
{
218+
"results": [
219+
{
220+
"group": "host-A",
221+
"metrics": [
222+
[ "2020-02-26T15:23:36.635Z", 0 ]
223+
]
224+
}
225+
]
226+
}
227+
```
228+
229+
[it-timeSeries-types]: server/alert_types/index_threshold/lib/time_series_types.ts
230+
231+
### service functions
232+
233+
A single service function is available that provides the functionality
234+
of the http endpoint `POST /api/alerting_builtins/index_threshold/_time_series_query`,
235+
but as an API for Kibana plugins. The function is available as
236+
`alertingService.indexThreshold.timeSeriesQuery()`
237+
238+
The parameters and return value for the function are the same as for the HTTP
239+
request, though some additional parameters are required (logger, callCluster,
240+
etc).
241+
242+
### notes on the timeSeriesQuery API / http endpoint
243+
244+
This API provides additional parameters beyond what the alertType itself uses:
245+
246+
- `dateStart`
247+
- `dateEnd`
248+
- `interval`
249+
250+
The `dateStart` and `dateEnd` parameters are ISO date strings.
251+
252+
The `interval` parameter is intended to model the `interval` the alert is
253+
currently using, and uses the same `1s`, `2m`, `3h`, etc format. Over the
254+
supplied date range, a time-series data point will be calculated every
255+
`interval` duration.
256+
257+
So the number of time-series points in the output of the API should be:
258+
259+
```
260+
( dateStart - dateEnd ) / interval
261+
```
262+
263+
Example:
264+
265+
```
266+
dateStart: '2020-01-01T00:00:00'
267+
dateEnd: '2020-01-02T00:00:00'
268+
interval: '1h'
269+
```
270+
271+
The date range is 1 day === 24 hours. The interval is 1 hour. So there should
272+
be ~24 time series points in the output.
273+
274+
For preview purposes:
275+
276+
- The `groupLimit` parameter should be used to help cut
277+
down on the amount of work ES does, and keep the generated graphs a little
278+
simpler. Probably something like `10`.
279+
280+
- For queries with long date ranges, you probably don't want to use the
281+
`interval` the alert is set to, as the `interval` used in the query, as this
282+
could result in a lot of time-series points being generated, which is both
283+
costly in ES, and may result in noisy graphs.
284+
285+
- The `window` parameter should be the same as what the alert is using,
286+
especially for the `count` and `sum` aggregation types. Those aggregations
287+
don't scale the same way the others do, when the window changes. Even for
288+
the other aggregations, changing the window could result in dramatically
289+
different values being generated - `averages` will be more "average-y", `min`
290+
and `max` will be a little stickier.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"id": "alertingBuiltins",
3+
"server": true,
4+
"version": "8.0.0",
5+
"kibanaVersion": "kibana",
6+
"requiredPlugins": ["alerting"],
7+
"ui": false
8+
}
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License;
4+
* you may not use this file except in compliance with the Elastic License.
5+
*/
6+
7+
import { Service, IRouter, AlertingSetup } from '../types';
8+
import { register as registerIndexThreshold } from './index_threshold';
9+
10+
interface RegisterBuiltInAlertTypesParams {
11+
service: Service;
12+
router: IRouter;
13+
alerting: AlertingSetup;
14+
baseRoute: string;
15+
}
16+
17+
export function registerBuiltInAlertTypes(params: RegisterBuiltInAlertTypesParams) {
18+
registerIndexThreshold(params);
19+
}

0 commit comments

Comments
 (0)