Use an "in-alarm" variable to prevent sending Slack alerts every 15 minutes for Elasticsearch cluster health

## Problem



On two occasions now, we have had an Elasticsearch node fail over the weekend and issue alerts every 15 minutes (the DAG's [schedule interval](https://github.com/WordPress/openverse/blob/352b71080efd33701f93d10017bcfdcdbbc156df/catalog/dags/elasticsearch_cluster/healthcheck_dag.py#L157-L158)). This flooded our alerts channel with notifications that had to be scrolled through once maintainers got back online on Monday.

## Description



We should always alert on the first instance of this healthcheck failing from a healthy state. In that case, we should also create a [Variable](https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html) which includes a timestamp of when the alert failed.

On all DAG runs, this Variable would be checked prior to alerting to determine if an alert should be sent to the Slack channel.

If the timestamp is from within the last 6 hours (or different if other folks think this window should be shorter), the alert should not be sent. If it's outside that window, the alert should be sent and the Variable's timestamp should be updated.

Once a healthcheck succeeds (i.e. the cluster health is back to normal), the Variable should be cleared or deleted.

## Alternatives



Leave things as is and accept that we're likely going to encounter [alert fatigue](https://www.datadoghq.com/blog/best-practices-to-prevent-alert-fatigue/).

## Additional context



We don't want to change the schedule interval of this DAG because we want as quick a notice that we have an issue with the Elasticsearch cluster as possible. Adding the Variable allows us to tune the frequency of alerts while keeping our resolution on actually checking the cluster quite high.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use an "in-alarm" variable to prevent sending Slack alerts every 15 minutes for Elasticsearch cluster health #4638

Problem

Description

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use an "in-alarm" variable to prevent sending Slack alerts every 15 minutes for Elasticsearch cluster health #4638

Description

Problem

Description

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions