|
| 1 | +# alerting_builtins plugin |
| 2 | + |
| 3 | +This plugin provides alertTypes shipped with Kibana for use with the |
| 4 | +[the alerting plugin](../alerting/README.md). When enabled, it will register |
| 5 | +the built-in alertTypes with the alerting plugin, register associated HTTP |
| 6 | +routes, etc. |
| 7 | + |
| 8 | +The plugin `setup` and `start` contracts for this plugin are the following |
| 9 | +type, which provides some runtime capabilities. Each built-in alertType will |
| 10 | +have it's own top-level property in the `IService` interface, if it needs to |
| 11 | +expose functionality. |
| 12 | + |
| 13 | +```ts |
| 14 | +export interface IService { |
| 15 | + indexThreshold: { |
| 16 | + timeSeriesQuery(params: TimeSeriesQueryParameters): Promise<TimeSeriesResult>; |
| 17 | + } |
| 18 | +} |
| 19 | +``` |
| 20 | + |
| 21 | +## built-in alertType index threshold |
| 22 | + |
| 23 | +The index threshold alert type is designed to run an ES query over indices, |
| 24 | +aggregating field values from documents, comparing them to threshold values, |
| 25 | +and scheduling actions to run when the thresholds are met. |
| 26 | + |
| 27 | +And example would be checking a monitoring index for percent cpu usage field |
| 28 | +values that are greater than some threshold, which could then be used to invoke |
| 29 | +an action (email, slack, etc) to notify interested parties when the threshold |
| 30 | +is exceeded. |
| 31 | + |
| 32 | +### alertType `.index-threshold` |
| 33 | + |
| 34 | +The alertType parameters are specified in |
| 35 | +[`index_threshold/lib/core_query_types.ts`][it-core-query] |
| 36 | +and |
| 37 | +[`index_threshold/alert_type_params.ts`][it-alert-params]. |
| 38 | + |
| 39 | +The alertType has a single actionGroup, `'threshold met'`. The `context` object |
| 40 | +provided to actions is specified in |
| 41 | +[`index_threshold/action_context.ts`][it-alert-context]. |
| 42 | + |
| 43 | +[it-alert-params]: server/alert_types/index_threshold/alert_type_params.ts |
| 44 | +[it-alert-context]: server/alert_types/index_threshold/action_context.ts |
| 45 | +[it-core-query]: server/alert_types/index_threshold/lib/core_query_types.ts |
| 46 | + |
| 47 | +#### example |
| 48 | + |
| 49 | +This example uses [kbn-action][]'s `kbn-alert` command to create the alert, |
| 50 | +and [es-hb-sim][] to generate ES documents for the alert to run queries |
| 51 | +against. |
| 52 | + |
| 53 | +Start `es-hb-sim`: |
| 54 | + |
| 55 | +``` |
| 56 | +es-hb-sim 1 es-hb-sim host-A https://elastic:changeme@localhost:9200 |
| 57 | +``` |
| 58 | + |
| 59 | +This will start indexing documents of the following form, to the `es-hb-sim` |
| 60 | +index: |
| 61 | + |
| 62 | +``` |
| 63 | +{"@timestamp":"2020-02-20T22:10:30.011Z","summary":{"up":1,"down":0},"monitor":{"status":"up","name":"host-A"}} |
| 64 | +``` |
| 65 | + |
| 66 | +Press `u` to have it start writing "down" documents instead of "up" documents. |
| 67 | + |
| 68 | +Create a server log action that we can use with the alert: |
| 69 | + |
| 70 | +``` |
| 71 | +export ACTION_ID=`kbn-action create .server-log 'server-log' '{}' '{}' | jq -r '.id'` |
| 72 | +``` |
| 73 | + |
| 74 | +Finally, create the alert: |
| 75 | + |
| 76 | +``` |
| 77 | +kbn-alert create .index-threshold 'es-hb-sim threshold' 1s \ |
| 78 | + '{ |
| 79 | + index: es-hb-sim |
| 80 | + timeField: @timestamp |
| 81 | + aggType: average |
| 82 | + aggField: summary.up |
| 83 | + groupField: monitor.name.keyword |
| 84 | + window: 5s |
| 85 | + comparator: lessThan |
| 86 | + threshold: [ 0.6 ] |
| 87 | + }' \ |
| 88 | + "[ |
| 89 | + { |
| 90 | + group: threshold met |
| 91 | + id: '$ACTION_ID' |
| 92 | + params: { |
| 93 | + level: warn |
| 94 | + message: '{{context.message}}' |
| 95 | + } |
| 96 | + } |
| 97 | + ]" |
| 98 | +``` |
| 99 | + |
| 100 | +This alert will run a query over the `es-hb-sim` index, using the `@timestamp` |
| 101 | +field as the date field, using an `average` aggregation over the `summary.up` |
| 102 | +field. The results are then aggregated by `monitor.name.keyword`. If we ran |
| 103 | +another instance of `es-hb-sim`, using `host-B` instead of `host-A`, then the |
| 104 | +alert will end up potentially scheduling actions for both, independently. |
| 105 | +Within the alerting plugin, this grouping is also referred to as "instanceIds" |
| 106 | +(`host-A` and `host-B` being distinct instanceIds, which can have actions |
| 107 | +scheduled against them independently). |
| 108 | + |
| 109 | +The `window` is set to `5s` which is 5 seconds. That means, every time the |
| 110 | +alert runs it's queries (every second, in the example above), it will run it's |
| 111 | +ES query over the last 5 seconds. Thus, the queries, over time, will overlap. |
| 112 | +Sometimes that's what you want. Other times, maybe you just want to do |
| 113 | +sampling, running an alert every hour, with a 5 minute window. Up to the you! |
| 114 | + |
| 115 | +Using the `comparator` `lessThan` and `threshold` `[0.6]`, the alert will |
| 116 | +calculate the average of all the `summary.up` fields for each unique |
| 117 | +`monitor.name.keyword`, and then if the value is less than 0.6, it will |
| 118 | +schedule the specified action (server log) to run. The `message` param |
| 119 | +passed to the action includes a mustache template for the context variable |
| 120 | +`message`, which is created by the alert type. That message generates |
| 121 | +a generic but useful text message, already constructed. Alternatively, |
| 122 | +a customer could set the `message` param in the action to a much more |
| 123 | +complex message, using other context variables made available by the |
| 124 | +alert type. |
| 125 | + |
| 126 | +Here's the message you should see in the Kibana console, if everything is |
| 127 | +working: |
| 128 | + |
| 129 | +``` |
| 130 | +server log [17:32:10.060] [warning][actions][actions][plugins] \ |
| 131 | + Server log: alert es-hb-sim threshold instance host-A value 0 \ |
| 132 | + exceeded threshold average(summary.up) lessThan 0.6 over 5s \ |
| 133 | + on 2020-02-20T22:32:07.000Z |
| 134 | +``` |
| 135 | + |
| 136 | +[kbn-action]: https://github.com/pmuellr/kbn-action |
| 137 | +[es-hb-sim]: https://github.com/pmuellr/es-hb-sim |
| 138 | + |
| 139 | + |
| 140 | +### http endpoints |
| 141 | + |
| 142 | +An HTTP endpoint is provided to return the values the alertType would calculate, |
| 143 | +over a series of time. This is intended to be used in the alerting UI to |
| 144 | +provide a "preview" of the alert during creation/editing based on recent data, |
| 145 | +and could be used to show a "simulation" of the the alert over an arbitrary |
| 146 | +range of time. |
| 147 | + |
| 148 | +The endpoint is `POST /api/alerting_builtins/index_threshold/_time_series_query`. |
| 149 | +The request and response bodies are specifed in |
| 150 | +[`index_threshold/lib/core_query_types.ts`][it-core-query] |
| 151 | +and |
| 152 | +[`index_threshold/lib/time_series_types.ts`][it-timeSeries-types]. |
| 153 | +The request body is very similar to the alertType's parameters. |
| 154 | + |
| 155 | +#### example |
| 156 | + |
| 157 | +Continuing with the example above, here's a query to get the values calculated |
| 158 | +for the last 10 seconds: |
| 159 | + |
| 160 | +_note: you'll need to change the `dateStart` and `dateEnd` |
| 161 | +values as appropriate_ |
| 162 | + |
| 163 | +```console |
| 164 | +curl -k "https://elastic:changeme@localhost:5601/api/alerting_builtins/index_threshold/_time_series_query" \ |
| 165 | + -H "kbn-xsrf: foo" -H "content-type: application/json" -d '{ |
| 166 | + "index": "es-hb-sim", |
| 167 | + "timeField": "@timestamp", |
| 168 | + "aggType": "average", |
| 169 | + "aggField": "summary.up", |
| 170 | + "groupField": "monitor.name.keyword", |
| 171 | + "interval": "1s", |
| 172 | + "dateStart": "2020-02-26T15:10:40.000Z", |
| 173 | + "dateEnd": "2020-02-26T15:10:50.000Z", |
| 174 | + "window": "5s" |
| 175 | +}' |
| 176 | +``` |
| 177 | + |
| 178 | +``` |
| 179 | +{ |
| 180 | + "results": [ |
| 181 | + { |
| 182 | + "group": "host-A", |
| 183 | + "metrics": [ |
| 184 | + [ "2020-02-26T15:10:40.000Z", 0 ], |
| 185 | + [ "2020-02-26T15:10:41.000Z", 0 ], |
| 186 | + [ "2020-02-26T15:10:42.000Z", 0 ], |
| 187 | + [ "2020-02-26T15:10:43.000Z", 0 ], |
| 188 | + [ "2020-02-26T15:10:44.000Z", 0 ], |
| 189 | + [ "2020-02-26T15:10:45.000Z", 0 ], |
| 190 | + [ "2020-02-26T15:10:46.000Z", 0 ], |
| 191 | + [ "2020-02-26T15:10:47.000Z", 0 ], |
| 192 | + [ "2020-02-26T15:10:48.000Z", 0 ], |
| 193 | + [ "2020-02-26T15:10:49.000Z", 0 ], |
| 194 | + [ "2020-02-26T15:10:50.000Z", 0 ] |
| 195 | + ] |
| 196 | + } |
| 197 | + ] |
| 198 | +} |
| 199 | +``` |
| 200 | + |
| 201 | +To get the current value of the calculated metric, you can leave off the date: |
| 202 | + |
| 203 | +``` |
| 204 | +curl -k "https://elastic:changeme@localhost:5601/api/alerting_builtins/index_threshold/_time_series_query" \ |
| 205 | + -H "kbn-xsrf: foo" -H "content-type: application/json" -d '{ |
| 206 | + "index": "es-hb-sim", |
| 207 | + "timeField": "@timestamp", |
| 208 | + "aggType": "average", |
| 209 | + "aggField": "summary.up", |
| 210 | + "groupField": "monitor.name.keyword", |
| 211 | + "interval": "1s", |
| 212 | + "window": "5s" |
| 213 | +}' |
| 214 | +``` |
| 215 | + |
| 216 | +``` |
| 217 | +{ |
| 218 | + "results": [ |
| 219 | + { |
| 220 | + "group": "host-A", |
| 221 | + "metrics": [ |
| 222 | + [ "2020-02-26T15:23:36.635Z", 0 ] |
| 223 | + ] |
| 224 | + } |
| 225 | + ] |
| 226 | +} |
| 227 | +``` |
| 228 | + |
| 229 | +[it-timeSeries-types]: server/alert_types/index_threshold/lib/time_series_types.ts |
| 230 | + |
| 231 | +### service functions |
| 232 | + |
| 233 | +A single service function is available that provides the functionality |
| 234 | +of the http endpoint `POST /api/alerting_builtins/index_threshold/_time_series_query`, |
| 235 | +but as an API for Kibana plugins. The function is available as |
| 236 | +`alertingService.indexThreshold.timeSeriesQuery()` |
| 237 | + |
| 238 | +The parameters and return value for the function are the same as for the HTTP |
| 239 | +request, though some additional parameters are required (logger, callCluster, |
| 240 | +etc). |
| 241 | + |
| 242 | +### notes on the timeSeriesQuery API / http endpoint |
| 243 | + |
| 244 | +This API provides additional parameters beyond what the alertType itself uses: |
| 245 | + |
| 246 | +- `dateStart` |
| 247 | +- `dateEnd` |
| 248 | +- `interval` |
| 249 | + |
| 250 | +The `dateStart` and `dateEnd` parameters are ISO date strings. |
| 251 | + |
| 252 | +The `interval` parameter is intended to model the `interval` the alert is |
| 253 | +currently using, and uses the same `1s`, `2m`, `3h`, etc format. Over the |
| 254 | +supplied date range, a time-series data point will be calculated every |
| 255 | +`interval` duration. |
| 256 | + |
| 257 | +So the number of time-series points in the output of the API should be: |
| 258 | + |
| 259 | +``` |
| 260 | +( dateStart - dateEnd ) / interval |
| 261 | +``` |
| 262 | + |
| 263 | +Example: |
| 264 | + |
| 265 | +``` |
| 266 | +dateStart: '2020-01-01T00:00:00' |
| 267 | +dateEnd: '2020-01-02T00:00:00' |
| 268 | +interval: '1h' |
| 269 | +``` |
| 270 | + |
| 271 | +The date range is 1 day === 24 hours. The interval is 1 hour. So there should |
| 272 | +be ~24 time series points in the output. |
| 273 | + |
| 274 | +For preview purposes: |
| 275 | + |
| 276 | +- The `groupLimit` parameter should be used to help cut |
| 277 | +down on the amount of work ES does, and keep the generated graphs a little |
| 278 | +simpler. Probably something like `10`. |
| 279 | + |
| 280 | +- For queries with long date ranges, you probably don't want to use the |
| 281 | +`interval` the alert is set to, as the `interval` used in the query, as this |
| 282 | +could result in a lot of time-series points being generated, which is both |
| 283 | +costly in ES, and may result in noisy graphs. |
| 284 | + |
| 285 | +- The `window` parameter should be the same as what the alert is using, |
| 286 | +especially for the `count` and `sum` aggregation types. Those aggregations |
| 287 | +don't scale the same way the others do, when the window changes. Even for |
| 288 | +the other aggregations, changing the window could result in dramatically |
| 289 | +different values being generated - `averages` will be more "average-y", `min` |
| 290 | +and `max` will be a little stickier. |
0 commit comments