Skip to content

Leak in Kubernetes Executor Running Tasks Slot Count #35675

@dirrao

Description

@dirrao

Apache Airflow version

main (development)

What happened

Schedulers are racing for pod adoption when there is a delay in schedulers' heartbeats. However, the schedulers are alive but not dead their heartbeat is delayed due to network timeout or heavy processing, etc. This leads to a leak in the executor.running_tasks slots. Eventually, the schedulers are not able to launch the pods due to executor.running_tasks=parallelism.

What you think should happen instead

We should remove the entry from the Kubernetes executor running queue when we worker pod deleted / moved to another scheduler.

How to reproduce

Reduce the scheduler_health_check_threshold=5 and orphaned_tasks_check_interval=10 values in the airflow config file
Launch the airflow with two schedulers and try to schedule multiple DAGs with backfill for every 1/5 mins.

Operating System

CentOS 6

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes=7.9.0

Deployment

Other Docker-based deployment

Deployment details

Terraform

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions