Description
Problem
On two occasions now, we have had an Elasticsearch node fail over the weekend and issue alerts every 15 minutes (the DAG's schedule interval). This flooded our alerts channel with notifications that had to be scrolled through once maintainers got back online on Monday.
Description
We should always alert on the first instance of this healthcheck failing from a healthy state. In that case, we should also create a Variable which includes a timestamp of when the alert failed.
On all DAG runs, this Variable would be checked prior to alerting to determine if an alert should be sent to the Slack channel.
If the timestamp is from within the last 6 hours (or different if other folks think this window should be shorter), the alert should not be sent. If it's outside that window, the alert should be sent and the Variable's timestamp should be updated.
Once a healthcheck succeeds (i.e. the cluster health is back to normal), the Variable should be cleared or deleted.
Alternatives
Leave things as is and accept that we're likely going to encounter alert fatigue.
Additional context
We don't want to change the schedule interval of this DAG because we want as quick a notice that we have an issue with the Elasticsearch cluster as possible. Adding the Variable allows us to tune the frequency of alerts while keeping our resolution on actually checking the cluster quite high.
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog