Skip to content

[BUG][Airflow 3] Task's state is treated as changed externally when some retries are left and listener is called twice #48927

@kacpermuda

Description

@kacpermuda

Apache Airflow version

main (development)

If "Other Airflow 2 version" selected, which one?

No response

What happened?

II've noticed that OpenLineage's listener is sending two FAIL events when task has some retries left but the problem is not OpenLineage-related - the listener_manager gets called twice.

The problem is that this get_listener_manager().hook.on_task_instance_failed listener's call is done on the scheduler, exactly here. This is because the task is falling into the if for tasks killed_externally, and the ti.handle_failure is called (and it has the listener's call inside).

What you think should happen instead?

I think that this simple task execution should not be treated as external task state change and for sure, the listener should be called once.

How to reproduce

  1. Run this DAG on latest main.
  2. Look in the scheduler logs, you should find ERROR logs Executor CeleryExecutor ... reported that the task instance ... finished with state success, but the task instance's state attribute is running and then some logs about listener's on_task_instance_failed being called (on DEBUG level)
import datetime as dt

from airflow import DAG
from airflow.providers.standard.operators.bash import BashOperator

with DAG(
    dag_id="dag_failure_wait",
    start_date=dt.datetime(2024, 7, 3),
    schedule=None,
    catchup=False,
) as dag:
    task_failure = BashOperator(
        task_id="task_failure",
        bash_command="sleep 2 && exit 1;",
        retry_delay=1,
        retries=1,
    )

Operating System

MacOS

Versions of Apache Airflow Providers

latest main

Deployment

Virtualenv installation

Deployment details

Breeze, with LocalExecutor / CeleryExecutor (tested both).
breeze start-airflow -b postgres
breeze start-airflow --integration openlineage -b postgres
breeze start-airflow --integration openlineage -b postgres --executor=CeleryExecutor

Anything else?

Logs for CeleryExecutor run:
log_celery_20250408_105902.txt
log_scheduler_20250408_105917.txt

Logs for LocalExecutor run:
log_scheduler_20250408_110833.txt

Logs for LocalExecutor without the OpenLineage integration:
log_scheduler_20250408_111105.txt

OL events received for the run (notice two FAIL events for first task run):

Image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

area:Listenersarea:corekind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions