Skip to content

Schedulers are issuing abrupt pod deletes when there is a delay in schedulers' heartbeat #31198

@dirrao

Description

@dirrao

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Apache Airflow version:
2.3.3

Schedulers are racing for pod adoption and leads abrupt pod deletes when there is delay in schedulers heartbeats. However the schedulers are alive but not dead their heartbeat is delayed due to network timeout or heavy processing and etc.

What you think should happen instead

If schedulers heartbeat is delayed for genuine reason, then we shouldn't kill the running worker pods.

How to reproduce

Reduce the scheduler_health_check_threshold=5 and orphaned_tasks_check_interval=10 values in the airflow config file
Launch the airflow with two schedulers and try to schedule multiple DAGs with backfill for every 1/5 mins.

Operating System

NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerkind:bugThis is a clearly a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions