-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow version
2.7.0
Also reproduced on 2.5.0
What happened
Recently we added some automation to restarting Airflow tasks with "clear" command so we use this feature a lot. We often clear tasks in RUNNING state, which means that they go into RESTARTING state. We noticed that a lot of those tasks get stuck in RESTARTING state. Our Airflow infrastructure runs in an environment where any process can get suddenly killed without graceful shutdown.
We run Airflow on GKE but I managed to reproduce this behaviour on local environment with SequentialExecutor. See "How to reproduce" below for details.
What you think should happen instead
Tasks should get cleaned after scheduler restart and eventually get scheduled and executed.
How to reproduce
After some code investigation, I reproduced this kind of behaviour on local environment and it seems that RESTARTING tasks are only properly handled if the original restarting task is gracefully shut down so it can mark task as UP_FOR_RETRY or at least there is a healthy scheduler to do it if they fail for any other reason. The problem is with the following scenario:
- Task is initially in RUNNING state.
- Scheduler process dies suddenly.
- The task process also dies suddenly.
- "clear" command is executed on the task so the state is changed to RESTARTING state by webserver process.
- From now on, even if we restart scheduler, the task will never get scheduled or change its state. It needs to have its state manually fixed, e.g. by clearing it again.
A recording of steps to reproduce on local environment:
https://vimeo.com/857192666?share=copy
Operating System
MacOS Ventura 13.4.1
Versions of Apache Airflow Providers
N/A
Deployment
Official Apache Airflow Helm Chart
Deployment details
N/A
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct