Skip to content

Zombie tasks in RESTARTING state are not cleaned #33661

@Bisk1

Description

@Bisk1

Apache Airflow version

2.7.0

Also reproduced on 2.5.0

What happened

Recently we added some automation to restarting Airflow tasks with "clear" command so we use this feature a lot. We often clear tasks in RUNNING state, which means that they go into RESTARTING state. We noticed that a lot of those tasks get stuck in RESTARTING state. Our Airflow infrastructure runs in an environment where any process can get suddenly killed without graceful shutdown.

We run Airflow on GKE but I managed to reproduce this behaviour on local environment with SequentialExecutor. See "How to reproduce" below for details.

What you think should happen instead

Tasks should get cleaned after scheduler restart and eventually get scheduled and executed.

How to reproduce

After some code investigation, I reproduced this kind of behaviour on local environment and it seems that RESTARTING tasks are only properly handled if the original restarting task is gracefully shut down so it can mark task as UP_FOR_RETRY or at least there is a healthy scheduler to do it if they fail for any other reason. The problem is with the following scenario:

  1. Task is initially in RUNNING state.
  2. Scheduler process dies suddenly.
  3. The task process also dies suddenly.
  4. "clear" command is executed on the task so the state is changed to RESTARTING state by webserver process.
  5. From now on, even if we restart scheduler, the task will never get scheduled or change its state. It needs to have its state manually fixed, e.g. by clearing it again.

A recording of steps to reproduce on local environment:
https://vimeo.com/857192666?share=copy

Operating System

MacOS Ventura 13.4.1

Versions of Apache Airflow Providers

N/A

Deployment

Official Apache Airflow Helm Chart

Deployment details

N/A

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions