Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarms do not auto-reset #3

Closed
AnthonySteele opened this issue Jan 9, 2017 · 8 comments
Closed

Alarms do not auto-reset #3

AnthonySteele opened this issue Jan 9, 2017 · 8 comments

Comments

@AnthonySteele
Copy link
Contributor

AnthonySteele commented Jan 9, 2017

We have an existing system that goes via statds, graphite and seyren. When an alarm goes off, typically we respond to the alarm by taking actions that rectify the situation, watch the relevant metrics go back to normal levels, and the alarm is automatically resolved by seyren. This is useful.

Watchman gives us a simpler and faster system that goes from cloudwatch to pagerduty using pagerduty urls generated via https://www.pagerduty.com/docs/guides/aws-cloudwatch-integration-guide/ The main drawback is that as far as I can see, the alarm is never automatically resolved even when the underlying cloudwatch metric is back to normal. it requires manually resolving in pagerduty.

Pagerduty support team says:

You can add another notification to your alarm so that whenever the state is OK an event is sent to PagerDuty. This is outlined in the integration guide beginning at Step 12.

This will ensure that when the alarm returns to an OK state, the corresponding PagerDuty alert will be resolved. This resolve action only syncs on-way from AWS Cloudwatch to PagerDuty.

At this time, the INSUFFICIENT_DATA state can not resolve a PagerDuty incident. If you need that alarm state to resolve a PD incident, you will want to set up an email integration between AWS and PagerDuty so that you can customize the rules for how PagerDuty handles events sent from AWS.

@tomhaigh
Copy link
Contributor

tomhaigh commented Jan 9, 2017

Regarding "At this time, the INSUFFICIENT_DATA state can not resolve a PagerDuty incident" - I don't think that is something we would want for all alarms so would need to make sure if it does become possible that we apply it sensibly. Some alarms created via watchman specifically state that INSUFFICIENT_DATA will even raise an alert.
Adding the OK state notification would obviously be trivial in watchman if we wanted to

@AnthonySteele
Copy link
Contributor Author

AnthonySteele commented Jan 9, 2017

I think that we should add the Ok state notification, it will add value for the majority of alarms. This shortcoming has been noted when alarms are actually raised on-call.

The case of "INSUFFICIENT_DATA to reset" seems like an edge case? Most watchman alarms will be triggered because there is a continuous stream of data and it is currently out of range. Should be reset when it goes back in range. Others are triggered on INSUFFICIENT_DATA.

@tomhaigh
Copy link
Contributor

tomhaigh commented Jan 9, 2017

"INSUFFICIENT_DATA to reset" could be useful in some cases where you don't get zero data points - e.g. I think this is the case for ELB 5xx count. If you have a spike and then zero subsequently , you might not ever see an OK state, so treating INSUFFICIENT_DATA as zero makes sense. In other cases it obviously doesn't (e.g. custom metrics where it might mean the publisher has broken).
So yeah I think it's a bit of an edge case that we can leave for now.

@AnthonySteele
Copy link
Contributor Author

AnthonySteele commented Jan 9, 2017

Looks like the first action is, wherever we have AlarmActions = new List<string> {snsTopicArn} in the code (it's in 5 places), we also should have OKActions = new List<string> { snsTopicArn }.
and modify the InspectExistingAlarm methods so that lack of OkAction triggers the need for an update.

@AnthonySteele
Copy link
Contributor Author

AnthonySteele commented Jan 16, 2017

Merged, but it seems to break email integration, which does not notice the text and sounds the alarm on e.g.

- State Change:               INSUFFICIENT_DATA -> OK
- Reason for State Change:    Threshold Crossed: 1 datapoint (2.0) was not 
                              greater than or equal to the threshold (1200.0).

@AnthonySteele
Copy link
Contributor Author

AnthonySteele commented Jan 17, 2017

So, either we need a way to disable OKActions as a flag in in config, or when there's an email target. (what if there are two targets?)

Or we just tell users to avoid email targets, as they might have to parse the contents of the email to check if it's an alarm, an alarm reset or a null op. We are already moving email integrations to url to mitigate this issue at present.

@tomhaigh
Copy link
Contributor

We could create an SNS topic for each type of notification target - i.e. an email one (containing all email subscriptions) and a url one (with all url subscriptions) for each alerting group. That would mean we could restrict what types of notification go to each type of target without additional configuration.

@AnthonySteele
Copy link
Contributor Author

AnthonySteele commented Jan 23, 2017

Suggest that we close this issue and make that a separate one, along the lines of "Email target should only get notified on transition into error state as pagerduty email trigger can't tell them apart"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants