Skip to content

Airflow Task/DAG fails if connection to OTEL collector fails (when otel integration is enabled) #34405

@sa1

Description

@sa1

Apache Airflow version

2.7.1

What happened

I enabled the experimental OTEL integration, and sometimes the connection to OTEL collector fails. Such connection failures are expected and common. However, right now the task seems to fail and there is an extra point of failure added to each task and DAG. Sometimes the failures are before the DAG is even started, and task-level retries can't help.

The only error message I see in this case is the connection failure.

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=9999): Max retries exceeded with url: /v1/metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff41c054430>: Failed to establish a new connection: [Errno 111] Connection refused'))

This is not printed to the Airflow UI, only to the worker logs, so it's not obvious why a task/DAG failed.

What you think should happen instead

In this situation, Airflow should print a warning and continue with the task.

When any other python application is auto-instrumented with otel, the automatic instrumentation works in the desired way, it ignores connection failures and only prints out a warning message.

Maybe this setting could be configurable, but the desired behaviour should be to ignore the exception.

How to reproduce

Enable OTEL integration, and turn off the collector. Run any DAG/task and they will fail.

Operating System

Ubuntu 22.04.3 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.6.0
apache-airflow-providers-celery==3.3.3
apache-airflow-providers-common-sql==1.7.1
apache-airflow-providers-ftp==3.5.1
apache-airflow-providers-http==4.5.1
apache-airflow-providers-imap==3.3.1
apache-airflow-providers-openlineage==1.0.2
apache-airflow-providers-postgres==5.6.0
apache-airflow-providers-redis==3.3.1
apache-airflow-providers-slack==8.0.0
apache-airflow-providers-snowflake==5.0.0
apache-airflow-providers-sqlite==3.4.3
apache-airflow-providers-ssh==3.7.2

Deployment

Other Docker-based deployment

Deployment details

Docker based custom deployment on ECS Fargate.
Separate fargate tasks for webserver, worker, scheduler and triggerer.
Otel collector is running as an agent in each task.

Anything else

The task fails everytime the connection to otel collector fails. However why the otel collector fails sometimes is the subject of another investigation. Maybe it has to do with something with the size of data/metrics being sent to the collector. But I believe those reasons are not very relevant to this bug.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions