Skip to content

Use an "in-alarm" variable to prevent sending Slack alerts every 15 minutes for Elasticsearch cluster health #4638

Open

Description

Problem

On two occasions now, we have had an Elasticsearch node fail over the weekend and issue alerts every 15 minutes (the DAG's schedule interval). This flooded our alerts channel with notifications that had to be scrolled through once maintainers got back online on Monday.

Description

We should always alert on the first instance of this healthcheck failing from a healthy state. In that case, we should also create a Variable which includes a timestamp of when the alert failed.

On all DAG runs, this Variable would be checked prior to alerting to determine if an alert should be sent to the Slack channel.

If the timestamp is from within the last 6 hours (or different if other folks think this window should be shorter), the alert should not be sent. If it's outside that window, the alert should be sent and the Variable's timestamp should be updated.

Once a healthcheck succeeds (i.e. the cluster health is back to normal), the Variable should be cleared or deleted.

Alternatives

Leave things as is and accept that we're likely going to encounter alert fatigue.

Additional context

We don't want to change the schedule interval of this DAG because we want as quick a notice that we have an issue with the Elasticsearch cluster as possible. Adding the Variable allows us to tune the frequency of alerts while keeping our resolution on actually checking the cluster quite high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    help wantedOpen to participation from the community✨ goal: improvementImprovement to an existing user-facing feature💻 aspect: codeConcerns the software code in the repository🔧 tech: airflowInvolves Apache Airflow🟩 priority: lowLow priority and doesn't need to be rushed🧱 stack: catalogRelated to the catalog and Airflow DAGs

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions