Skip to content

Commit a517fd9

Browse files
authored
[alerting] initial index threshold alertType and supporting APIs (#57030) (#58901)
Adds the first built-in alertType for Kibana alerting, an index threshold alert, and associated HTTP endpoint to generate preview data for it. addresses the server-side requirements for issue #53041
1 parent e9f7cfd commit a517fd9

32 files changed

+2508
-0
lines changed

x-pack/.i18nrc.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
"xpack.actions": "plugins/actions",
55
"xpack.advancedUiActions": "plugins/advanced_ui_actions",
66
"xpack.alerting": "plugins/alerting",
7+
"xpack.alertingBuiltins": "plugins/alerting_builtins",
78
"xpack.apm": ["legacy/plugins/apm", "plugins/apm"],
89
"xpack.beatsManagement": "legacy/plugins/beats_management",
910
"xpack.canvas": "legacy/plugins/canvas",
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# alerting_builtins plugin
2+
3+
This plugin provides alertTypes shipped with Kibana for use with the
4+
[the alerting plugin](../alerting/README.md). When enabled, it will register
5+
the built-in alertTypes with the alerting plugin, register associated HTTP
6+
routes, etc.
7+
8+
The plugin `setup` and `start` contracts for this plugin are the following
9+
type, which provides some runtime capabilities. Each built-in alertType will
10+
have it's own top-level property in the `IService` interface, if it needs to
11+
expose functionality.
12+
13+
```ts
14+
export interface IService {
15+
indexThreshold: {
16+
timeSeriesQuery(params: TimeSeriesQueryParameters): Promise<TimeSeriesResult>;
17+
}
18+
}
19+
```
20+
21+
Each built-in alertType is described in it's own README:
22+
23+
- index threshold: [`server/alert_types/index_threshold`](server/alert_types/index_threshold/README.md)
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"id": "alertingBuiltins",
3+
"server": true,
4+
"version": "8.0.0",
5+
"kibanaVersion": "kibana",
6+
"requiredPlugins": ["alerting"],
7+
"ui": false
8+
}
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License;
4+
* you may not use this file except in compliance with the Elastic License.
5+
*/
6+
7+
import { Service, IRouter, AlertingSetup } from '../types';
8+
import { register as registerIndexThreshold } from './index_threshold';
9+
10+
interface RegisterBuiltInAlertTypesParams {
11+
service: Service;
12+
router: IRouter;
13+
alerting: AlertingSetup;
14+
baseRoute: string;
15+
}
16+
17+
export function registerBuiltInAlertTypes(params: RegisterBuiltInAlertTypesParams) {
18+
registerIndexThreshold(params);
19+
}
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
# built-in alertType index threshold
2+
3+
directory in plugin: `server/alert_types/index_threshold`
4+
5+
The index threshold alert type is designed to run an ES query over indices,
6+
aggregating field values from documents, comparing them to threshold values,
7+
and scheduling actions to run when the thresholds are met.
8+
9+
And example would be checking a monitoring index for percent cpu usage field
10+
values that are greater than some threshold, which could then be used to invoke
11+
an action (email, slack, etc) to notify interested parties when the threshold
12+
is exceeded.
13+
14+
## alertType `.index-threshold`
15+
16+
The alertType parameters are specified in
17+
[`lib/core_query_types.ts`][it-core-query]
18+
and
19+
[`alert_type_params.ts`][it-alert-params].
20+
21+
The alertType has a single actionGroup, `'threshold met'`. The `context` object
22+
provided to actions is specified in
23+
[`action_context.ts`][it-alert-context].
24+
25+
[it-alert-params]: alert_type_params.ts
26+
[it-alert-context]: action_context.ts
27+
[it-core-query]: lib/core_query_types.ts
28+
29+
### example
30+
31+
This example uses [kbn-action][]'s `kbn-alert` command to create the alert,
32+
and [es-hb-sim][] to generate ES documents for the alert to run queries
33+
against.
34+
35+
Start `es-hb-sim`:
36+
37+
```
38+
es-hb-sim 1 es-hb-sim host-A https://elastic:changeme@localhost:9200
39+
```
40+
41+
This will start indexing documents of the following form, to the `es-hb-sim`
42+
index:
43+
44+
```
45+
{"@timestamp":"2020-02-20T22:10:30.011Z","summary":{"up":1,"down":0},"monitor":{"status":"up","name":"host-A"}}
46+
```
47+
48+
Press `u` to have it start writing "down" documents instead of "up" documents.
49+
50+
Create a server log action that we can use with the alert:
51+
52+
```
53+
export ACTION_ID=`kbn-action create .server-log 'server-log' '{}' '{}' | jq -r '.id'`
54+
```
55+
56+
Finally, create the alert:
57+
58+
```
59+
kbn-alert create .index-threshold 'es-hb-sim threshold' 1s \
60+
'{
61+
index: es-hb-sim
62+
timeField: @timestamp
63+
aggType: average
64+
aggField: summary.up
65+
groupField: monitor.name.keyword
66+
window: 5s
67+
comparator: lessThan
68+
threshold: [ 0.6 ]
69+
}' \
70+
"[
71+
{
72+
group: threshold met
73+
id: '$ACTION_ID'
74+
params: {
75+
level: warn
76+
message: '{{context.message}}'
77+
}
78+
}
79+
]"
80+
```
81+
82+
This alert will run a query over the `es-hb-sim` index, using the `@timestamp`
83+
field as the date field, using an `average` aggregation over the `summary.up`
84+
field. The results are then aggregated by `monitor.name.keyword`. If we ran
85+
another instance of `es-hb-sim`, using `host-B` instead of `host-A`, then the
86+
alert will end up potentially scheduling actions for both, independently.
87+
Within the alerting plugin, this grouping is also referred to as "instanceIds"
88+
(`host-A` and `host-B` being distinct instanceIds, which can have actions
89+
scheduled against them independently).
90+
91+
The `window` is set to `5s` which is 5 seconds. That means, every time the
92+
alert runs it's queries (every second, in the example above), it will run it's
93+
ES query over the last 5 seconds. Thus, the queries, over time, will overlap.
94+
Sometimes that's what you want. Other times, maybe you just want to do
95+
sampling, running an alert every hour, with a 5 minute window. Up to the you!
96+
97+
Using the `comparator` `lessThan` and `threshold` `[0.6]`, the alert will
98+
calculate the average of all the `summary.up` fields for each unique
99+
`monitor.name.keyword`, and then if the value is less than 0.6, it will
100+
schedule the specified action (server log) to run. The `message` param
101+
passed to the action includes a mustache template for the context variable
102+
`message`, which is created by the alert type. That message generates
103+
a generic but useful text message, already constructed. Alternatively,
104+
a customer could set the `message` param in the action to a much more
105+
complex message, using other context variables made available by the
106+
alert type.
107+
108+
Here's the message you should see in the Kibana console, if everything is
109+
working:
110+
111+
```
112+
server log [17:32:10.060] [warning][actions][actions][plugins] \
113+
Server log: alert es-hb-sim threshold instance host-A value 0 \
114+
exceeded threshold average(summary.up) lessThan 0.6 over 5s \
115+
on 2020-02-20T22:32:07.000Z
116+
```
117+
118+
[kbn-action]: https://github.com/pmuellr/kbn-action
119+
[es-hb-sim]: https://github.com/pmuellr/es-hb-sim
120+
[now-iso]: https://github.com/pmuellr/now-iso
121+
122+
123+
## http endpoints
124+
125+
An HTTP endpoint is provided to return the values the alertType would calculate,
126+
over a series of time. This is intended to be used in the alerting UI to
127+
provide a "preview" of the alert during creation/editing based on recent data,
128+
and could be used to show a "simulation" of the the alert over an arbitrary
129+
range of time.
130+
131+
The endpoint is `POST /api/alerting_builtins/index_threshold/_time_series_query`.
132+
The request and response bodies are specifed in
133+
[`lib/core_query_types.ts`][it-core-query]
134+
and
135+
[`lib/time_series_types.ts`][it-timeSeries-types].
136+
The request body is very similar to the alertType's parameters.
137+
138+
### example
139+
140+
Continuing with the example above, here's a query to get the values calculated
141+
for the last 10 seconds.
142+
This example uses [now-iso][] to generate iso date strings.
143+
144+
```console
145+
curl -k "https://elastic:changeme@localhost:5601/api/alerting_builtins/index_threshold/_time_series_query" \
146+
-H "kbn-xsrf: foo" -H "content-type: application/json" -d "{
147+
\"index\": \"es-hb-sim\",
148+
\"timeField\": \"@timestamp\",
149+
\"aggType\": \"average\",
150+
\"aggField\": \"summary.up\",
151+
\"groupField\": \"monitor.name.keyword\",
152+
\"interval\": \"1s\",
153+
\"dateStart\": \"`now-iso -10s`\",
154+
\"dateEnd\": \"`now-iso`\",
155+
\"window\": \"5s\"
156+
}"
157+
```
158+
159+
```
160+
{
161+
"results": [
162+
{
163+
"group": "host-A",
164+
"metrics": [
165+
[ "2020-02-26T15:10:40.000Z", 0 ],
166+
[ "2020-02-26T15:10:41.000Z", 0 ],
167+
[ "2020-02-26T15:10:42.000Z", 0 ],
168+
[ "2020-02-26T15:10:43.000Z", 0 ],
169+
[ "2020-02-26T15:10:44.000Z", 0 ],
170+
[ "2020-02-26T15:10:45.000Z", 0 ],
171+
[ "2020-02-26T15:10:46.000Z", 0 ],
172+
[ "2020-02-26T15:10:47.000Z", 0 ],
173+
[ "2020-02-26T15:10:48.000Z", 0 ],
174+
[ "2020-02-26T15:10:49.000Z", 0 ],
175+
[ "2020-02-26T15:10:50.000Z", 0 ]
176+
]
177+
}
178+
]
179+
}
180+
```
181+
182+
To get the current value of the calculated metric, you can leave off the date:
183+
184+
```
185+
curl -k "https://elastic:changeme@localhost:5601/api/alerting_builtins/index_threshold/_time_series_query" \
186+
-H "kbn-xsrf: foo" -H "content-type: application/json" -d '{
187+
"index": "es-hb-sim",
188+
"timeField": "@timestamp",
189+
"aggType": "average",
190+
"aggField": "summary.up",
191+
"groupField": "monitor.name.keyword",
192+
"interval": "1s",
193+
"window": "5s"
194+
}'
195+
```
196+
197+
```
198+
{
199+
"results": [
200+
{
201+
"group": "host-A",
202+
"metrics": [
203+
[ "2020-02-26T15:23:36.635Z", 0 ]
204+
]
205+
}
206+
]
207+
}
208+
```
209+
210+
[it-timeSeries-types]: lib/time_series_types.ts
211+
212+
## service functions
213+
214+
A single service function is available that provides the functionality
215+
of the http endpoint `POST /api/alerting_builtins/index_threshold/_time_series_query`,
216+
but as an API for Kibana plugins. The function is available as
217+
`alertingService.indexThreshold.timeSeriesQuery()`
218+
219+
The parameters and return value for the function are the same as for the HTTP
220+
request, though some additional parameters are required (logger, callCluster,
221+
etc).
222+
223+
## notes on the timeSeriesQuery API / http endpoint
224+
225+
This API provides additional parameters beyond what the alertType itself uses:
226+
227+
- `dateStart`
228+
- `dateEnd`
229+
- `interval`
230+
231+
The `dateStart` and `dateEnd` parameters are ISO date strings.
232+
233+
The `interval` parameter is intended to model the `interval` the alert is
234+
currently using, and uses the same `1s`, `2m`, `3h`, etc format. Over the
235+
supplied date range, a time-series data point will be calculated every
236+
`interval` duration.
237+
238+
So the number of time-series points in the output of the API should be:
239+
240+
```
241+
( dateStart - dateEnd ) / interval
242+
```
243+
244+
Example:
245+
246+
```
247+
dateStart: '2020-01-01T00:00:00'
248+
dateEnd: '2020-01-02T00:00:00'
249+
interval: '1h'
250+
```
251+
252+
The date range is 1 day === 24 hours. The interval is 1 hour. So there should
253+
be ~24 time series points in the output.
254+
255+
For preview purposes:
256+
257+
- The `groupLimit` parameter should be used to help cut
258+
down on the amount of work ES does, and keep the generated graphs a little
259+
simpler. Probably something like `10`.
260+
261+
- For queries with long date ranges, you probably don't want to use the
262+
`interval` the alert is set to, as the `interval` used in the query, as this
263+
could result in a lot of time-series points being generated, which is both
264+
costly in ES, and may result in noisy graphs.
265+
266+
- The `window` parameter should be the same as what the alert is using,
267+
especially for the `count` and `sum` aggregation types. Those aggregations
268+
don't scale the same way the others do, when the window changes. Even for
269+
the other aggregations, changing the window could result in dramatically
270+
different values being generated - `averages` will be more "average-y", `min`
271+
and `max` will be a little stickier.

0 commit comments

Comments
 (0)