Alarms do not auto-reset #3

AnthonySteele · 2017-01-09T10:26:09Z

We have an existing system that goes via statds, graphite and seyren. When an alarm goes off, typically we respond to the alarm by taking actions that rectify the situation, watch the relevant metrics go back to normal levels, and the alarm is automatically resolved by seyren. This is useful.

Watchman gives us a simpler and faster system that goes from cloudwatch to pagerduty using pagerduty urls generated via https://www.pagerduty.com/docs/guides/aws-cloudwatch-integration-guide/ The main drawback is that as far as I can see, the alarm is never automatically resolved even when the underlying cloudwatch metric is back to normal. it requires manually resolving in pagerduty.

Pagerduty support team says:

You can add another notification to your alarm so that whenever the state is OK an event is sent to PagerDuty. This is outlined in the integration guide beginning at Step 12.

This will ensure that when the alarm returns to an OK state, the corresponding PagerDuty alert will be resolved. This resolve action only syncs on-way from AWS Cloudwatch to PagerDuty.

At this time, the INSUFFICIENT_DATA state can not resolve a PagerDuty incident. If you need that alarm state to resolve a PD incident, you will want to set up an email integration between AWS and PagerDuty so that you can customize the rules for how PagerDuty handles events sent from AWS.

tomhaigh · 2017-01-09T10:33:30Z

Regarding "At this time, the INSUFFICIENT_DATA state can not resolve a PagerDuty incident" - I don't think that is something we would want for all alarms so would need to make sure if it does become possible that we apply it sensibly. Some alarms created via watchman specifically state that INSUFFICIENT_DATA will even raise an alert.
Adding the OK state notification would obviously be trivial in watchman if we wanted to

AnthonySteele · 2017-01-09T10:37:42Z

I think that we should add the Ok state notification, it will add value for the majority of alarms. This shortcoming has been noted when alarms are actually raised on-call.

The case of "INSUFFICIENT_DATA to reset" seems like an edge case? Most watchman alarms will be triggered because there is a continuous stream of data and it is currently out of range. Should be reset when it goes back in range. Others are triggered on INSUFFICIENT_DATA.

tomhaigh · 2017-01-09T10:45:38Z

"INSUFFICIENT_DATA to reset" could be useful in some cases where you don't get zero data points - e.g. I think this is the case for ELB 5xx count. If you have a spike and then zero subsequently , you might not ever see an OK state, so treating INSUFFICIENT_DATA as zero makes sense. In other cases it obviously doesn't (e.g. custom metrics where it might mean the publisher has broken).
So yeah I think it's a bit of an edge case that we can leave for now.

AnthonySteele · 2017-01-09T14:27:56Z

Looks like the first action is, wherever we have AlarmActions = new List<string> {snsTopicArn} in the code (it's in 5 places), we also should have OKActions = new List<string> { snsTopicArn }.
and modify the InspectExistingAlarm methods so that lack of OkAction triggers the need for an update.

AnthonySteele · 2017-01-16T10:28:50Z

Merged, but it seems to break email integration, which does not notice the text and sounds the alarm on e.g.

- State Change:               INSUFFICIENT_DATA -> OK
- Reason for State Change:    Threshold Crossed: 1 datapoint (2.0) was not 
                              greater than or equal to the threshold (1200.0).

AnthonySteele · 2017-01-17T10:00:23Z

So, either we need a way to disable OKActions as a flag in in config, or when there's an email target. (what if there are two targets?)

Or we just tell users to avoid email targets, as they might have to parse the contents of the email to check if it's an alarm, an alarm reset or a null op. We are already moving email integrations to url to mitigate this issue at present.

tomhaigh · 2017-01-17T12:48:24Z

We could create an SNS topic for each type of notification target - i.e. an email one (containing all email subscriptions) and a url one (with all url subscriptions) for each alerting group. That would mean we could restrict what types of notification go to each type of target without additional configuration.

AnthonySteele · 2017-01-23T17:14:09Z

Suggest that we close this issue and make that a separate one, along the lines of "Email target should only get notified on transition into error state as pagerduty email trigger can't tell them apart"

AnthonySteele mentioned this issue Jan 9, 2017

Auto reset alarms #5

Merged

AnthonySteele added the enhancement label Jan 17, 2017

AnthonySteele mentioned this issue Jan 24, 2017

Email target should only get notified on transition to error state #15

Open

AnthonySteele closed this as completed Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alarms do not auto-reset #3

Alarms do not auto-reset #3

AnthonySteele commented Jan 9, 2017 •

edited

Loading

tomhaigh commented Jan 9, 2017

AnthonySteele commented Jan 9, 2017 •

edited

Loading

tomhaigh commented Jan 9, 2017

AnthonySteele commented Jan 9, 2017 •

edited

Loading

AnthonySteele commented Jan 16, 2017 •

edited

Loading

AnthonySteele commented Jan 17, 2017 •

edited

Loading

tomhaigh commented Jan 17, 2017

AnthonySteele commented Jan 23, 2017 •

edited

Loading

Alarms do not auto-reset #3

Alarms do not auto-reset #3

Comments

AnthonySteele commented Jan 9, 2017 • edited Loading

tomhaigh commented Jan 9, 2017

AnthonySteele commented Jan 9, 2017 • edited Loading

tomhaigh commented Jan 9, 2017

AnthonySteele commented Jan 9, 2017 • edited Loading

AnthonySteele commented Jan 16, 2017 • edited Loading

AnthonySteele commented Jan 17, 2017 • edited Loading

tomhaigh commented Jan 17, 2017

AnthonySteele commented Jan 23, 2017 • edited Loading

AnthonySteele commented Jan 9, 2017 •

edited

Loading

AnthonySteele commented Jan 9, 2017 •

edited

Loading

AnthonySteele commented Jan 9, 2017 •

edited

Loading

AnthonySteele commented Jan 16, 2017 •

edited

Loading

AnthonySteele commented Jan 17, 2017 •

edited

Loading

AnthonySteele commented Jan 23, 2017 •

edited

Loading