-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alarms do not auto-reset #3
Comments
Regarding "At this time, the INSUFFICIENT_DATA state can not resolve a PagerDuty incident" - I don't think that is something we would want for all alarms so would need to make sure if it does become possible that we apply it sensibly. Some alarms created via watchman specifically state that INSUFFICIENT_DATA will even raise an alert. |
I think that we should add the Ok state notification, it will add value for the majority of alarms. This shortcoming has been noted when alarms are actually raised on-call. The case of "INSUFFICIENT_DATA to reset" seems like an edge case? Most watchman alarms will be triggered because there is a continuous stream of data and it is currently out of range. Should be reset when it goes back in range. Others are triggered on INSUFFICIENT_DATA. |
"INSUFFICIENT_DATA to reset" could be useful in some cases where you don't get zero data points - e.g. I think this is the case for ELB 5xx count. If you have a spike and then zero subsequently , you might not ever see an OK state, so treating INSUFFICIENT_DATA as zero makes sense. In other cases it obviously doesn't (e.g. custom metrics where it might mean the publisher has broken). |
Looks like the first action is, wherever we have |
Merged, but it seems to break email integration, which does not notice the text and sounds the alarm on e.g.
|
So, either we need a way to disable Or we just tell users to avoid email targets, as they might have to parse the contents of the email to check if it's an alarm, an alarm reset or a null op. We are already moving email integrations to url to mitigate this issue at present. |
We could create an SNS topic for each type of notification target - i.e. an email one (containing all email subscriptions) and a url one (with all url subscriptions) for each alerting group. That would mean we could restrict what types of notification go to each type of target without additional configuration. |
Suggest that we close this issue and make that a separate one, along the lines of "Email target should only get notified on transition into error state as pagerduty email trigger can't tell them apart" |
We have an existing system that goes via statds, graphite and seyren. When an alarm goes off, typically we respond to the alarm by taking actions that rectify the situation, watch the relevant metrics go back to normal levels, and the alarm is automatically resolved by seyren. This is useful.
Watchman gives us a simpler and faster system that goes from cloudwatch to pagerduty using pagerduty urls generated via https://www.pagerduty.com/docs/guides/aws-cloudwatch-integration-guide/ The main drawback is that as far as I can see, the alarm is never automatically resolved even when the underlying cloudwatch metric is back to normal. it requires manually resolving in pagerduty.
Pagerduty support team says:
The text was updated successfully, but these errors were encountered: