Skip to content

OTel task-level metrics (ti.finish, ti.start) lost — forked processes and KubernetesExecutor #64690

@MichaelRBlack

Description

@MichaelRBlack

Apache Airflow version

3.1.8

What happened

Task-level OTel metrics (e.g. ti.finish, ti.start) are silently lost and never reach the OTel collector. This affects all executors through two distinct failure modes:

  1. Forking executors (LocalExecutor, CeleryExecutor): The OTel SDK's Once() guard prevents set_meter_provider() from working in forked child processes, leaving the child with a stale MeterProvider whose export thread is dead after fork.

  2. KubernetesExecutor: Each task pod initializes a fresh MeterProvider (no fork issue), but the pod terminates before the PeriodicExportingMetricReader fires. ti.finish is recorded at the very end of task execution, and the pod exits before the export interval (default 60s, commonly configured to 30s) elapses — so the metric is buffered but never sent.

The result in both cases: ti.finish and ti.start metrics are missing from Prometheus/Grafana, while scheduler-level metrics (scheduler_heartbeat, executor.running_tasks, pool.*, etc.) work fine since they're emitted from long-lived processes.

What you think should happen instead

Task-level metrics emitted from any executor should be reliably exported to the OTel collector before the task process or pod exits.

How to reproduce

Forking executors (LocalExecutor, CeleryExecutor)

  1. Configure Airflow 3.x with OTel metrics enabled (otel_on = True)
  2. Run any DAG with LocalExecutor or CeleryExecutor
  3. Observe task logs show:
    INFO - Stats instance was created in PID 7 but accessed in PID 19. Re-initializing.
    INFO - [Metric Exporter] Connecting to OpenTelemetry Collector at http://...
    WARNING - Overriding of current MeterProvider is not allowed
    
  4. Check Grafana/Prometheus — ti.finish metrics are missing

KubernetesExecutor

  1. Configure Airflow 3.x with OTel metrics enabled and KubernetesExecutor
  2. Run any DAG — tasks execute in individual pods
  3. No warning is logged (the MeterProvider initializes successfully in a fresh process)
  4. Check Grafana/Prometheus — ti.finish metrics are still missing because the pod exited before the periodic export

Root cause

Failure mode 1: Forking executors — Once() guard survives fork

airflow/stats.py correctly detects PID mismatches after fork and re-initializes the Stats instance by calling otel_logger.get_otel_logger(). This creates a fresh MeterProvider and calls metrics.set_meter_provider().

However, the OTel Python SDK uses a Once() guard (opentelemetry/metrics/_internal/__init__.py):

_METER_PROVIDER_SET_ONCE = Once()

def set_meter_provider(meter_provider):
    def set_mp():
        global _METER_PROVIDER
        _METER_PROVIDER = meter_provider
        _PROXY_METER_PROVIDER.on_set_meter_provider(meter_provider)

    did_set = _METER_PROVIDER_SET_ONCE.do_once(set_mp)
    if not did_set:
        _logger.warning("Overriding of current MeterProvider is not allowed")

The Once._done = True flag from the parent process survives fork(), so the child's set_meter_provider() silently fails. The child ends up using the parent's stale MeterProvider whose PeriodicExportingMetricReader background thread is dead after fork.

The code path:

  1. stats.py:55-64 — detects PID mismatch, sets cls.instance = None, calls factory
  2. otel_logger.py:410 — creates new MeterProvider, calls metrics.set_meter_provider()
  3. OTel SDK Once().do_once() — returns False because _done was inherited from parent
  4. otel_logger.py returns SafeOtelLogger(metrics.get_meter_provider(), ...) — gets stale parent provider
  5. task_runner.py:1195Stats.incr("ti.finish", ...) → dead exporter → metrics lost

Failure mode 2: KubernetesExecutor — no flush before pod exit

With KubernetesExecutor, there is no fork — each task runs in a fresh pod with a correctly initialized MeterProvider. The problem is timing:

  1. Task executes and completes
  2. Stats.incr("ti.finish", ...) records the metric in the MeterProvider's in-memory buffer
  3. The task process exits and the pod is deleted (delete_worker_pods: True)
  4. The PeriodicExportingMetricReader never fires because the process ended before the next export interval

The OTel SDK does not automatically call force_flush() on process exit. Since task pods are short-lived, the buffered metrics are lost.

Proposed fix

Both failure modes can be addressed in core with minimal changes:

Fix 1: Reset OTel state after fork (forking executors)

Reset the OTel SDK's provider state in get_otel_logger() before calling set_meter_provider(). Since stats.py only calls the factory after detecting a PID mismatch (i.e., we know we're in a forked child), this is safe:

# otel_logger.py — before metrics.set_meter_provider(...)
import opentelemetry.metrics._internal as _metrics_internal
_metrics_internal._METER_PROVIDER_SET_ONCE._done = False
_metrics_internal._METER_PROVIDER = None

Fix 2: Flush OTel metrics on task teardown (all executors)

Add a force_flush() call in the task runner's shutdown path to ensure pending metrics are exported before the process exits. This benefits all executors but is critical for KubernetesExecutor:

# In the task execution teardown path (e.g., task_runner.py or via atexit)
from opentelemetry.metrics import get_meter_provider

provider = get_meter_provider()
if hasattr(provider, "force_flush"):
    provider.force_flush(timeout_millis=5000)

An atexit handler registered during OTel initialization in otel_logger.py would be the cleanest integration point, ensuring it fires regardless of how the task process exits.

Operating System

Linux (EKS, Kubernetes)

Versions of Apache Airflow Providers

N/A (core issue)

Deployment

Official Apache Airflow Helm Chart

Anything else

Workaround for forking executors: an Airflow plugin using os.register_at_fork(after_in_child=...) to reset the OTel state.

Workaround for KubernetesExecutor: an Airflow plugin that registers an atexit handler to call force_flush() on the MeterProvider.

Both workarounds function but shouldn't be necessary — the fixes belong in core.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corearea:metricskind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions