OTel task-level metrics (ti.finish, ti.start) lost — forked processes and KubernetesExecutor

### Apache Airflow version

3.1.8

### What happened

Task-level OTel metrics (e.g. `ti.finish`, `ti.start`) are silently lost and never reach the OTel collector. This affects **all executors** through two distinct failure modes:

1. **Forking executors** (LocalExecutor, CeleryExecutor): The OTel SDK's `Once()` guard prevents `set_meter_provider()` from working in forked child processes, leaving the child with a stale `MeterProvider` whose export thread is dead after fork.

2. **KubernetesExecutor**: Each task pod initializes a fresh `MeterProvider` (no fork issue), but the pod terminates before the `PeriodicExportingMetricReader` fires. `ti.finish` is recorded at the very end of task execution, and the pod exits before the export interval (default 60s, commonly configured to 30s) elapses — so the metric is buffered but never sent.

The result in both cases: `ti.finish` and `ti.start` metrics are missing from Prometheus/Grafana, while scheduler-level metrics (`scheduler_heartbeat`, `executor.running_tasks`, `pool.*`, etc.) work fine since they're emitted from long-lived processes.

### What you think should happen instead

Task-level metrics emitted from any executor should be reliably exported to the OTel collector before the task process or pod exits.

### How to reproduce

#### Forking executors (LocalExecutor, CeleryExecutor)

1. Configure Airflow 3.x with OTel metrics enabled (`otel_on = True`)
2. Run any DAG with LocalExecutor or CeleryExecutor
3. Observe task logs show:
   ```
   INFO - Stats instance was created in PID 7 but accessed in PID 19. Re-initializing.
   INFO - [Metric Exporter] Connecting to OpenTelemetry Collector at http://...
   WARNING - Overriding of current MeterProvider is not allowed
   ```
4. Check Grafana/Prometheus — `ti.finish` metrics are missing

#### KubernetesExecutor

1. Configure Airflow 3.x with OTel metrics enabled and KubernetesExecutor
2. Run any DAG — tasks execute in individual pods
3. No warning is logged (the `MeterProvider` initializes successfully in a fresh process)
4. Check Grafana/Prometheus — `ti.finish` metrics are still missing because the pod exited before the periodic export

### Root cause

#### Failure mode 1: Forking executors — `Once()` guard survives fork

`airflow/stats.py` correctly detects PID mismatches after fork and re-initializes the Stats instance by calling `otel_logger.get_otel_logger()`. This creates a fresh `MeterProvider` and calls `metrics.set_meter_provider()`.

However, the OTel Python SDK uses a `Once()` guard (`opentelemetry/metrics/_internal/__init__.py`):

```python
_METER_PROVIDER_SET_ONCE = Once()

def set_meter_provider(meter_provider):
    def set_mp():
        global _METER_PROVIDER
        _METER_PROVIDER = meter_provider
        _PROXY_METER_PROVIDER.on_set_meter_provider(meter_provider)

    did_set = _METER_PROVIDER_SET_ONCE.do_once(set_mp)
    if not did_set:
        _logger.warning("Overriding of current MeterProvider is not allowed")
```

The `Once._done = True` flag from the parent process survives `fork()`, so the child's `set_meter_provider()` silently fails. The child ends up using the parent's stale `MeterProvider` whose `PeriodicExportingMetricReader` background thread is dead after fork.

The code path:
1. **`stats.py:55-64`** — detects PID mismatch, sets `cls.instance = None`, calls factory
2. **`otel_logger.py:410`** — creates new `MeterProvider`, calls `metrics.set_meter_provider()`
3. **OTel SDK `Once().do_once()`** — returns `False` because `_done` was inherited from parent
4. **`otel_logger.py` returns** `SafeOtelLogger(metrics.get_meter_provider(), ...)` — gets stale parent provider
5. **`task_runner.py:1195`** — `Stats.incr("ti.finish", ...)` → dead exporter → metrics lost

#### Failure mode 2: KubernetesExecutor — no flush before pod exit

With KubernetesExecutor, there is no fork — each task runs in a fresh pod with a correctly initialized `MeterProvider`. The problem is timing:

1. Task executes and completes
2. `Stats.incr("ti.finish", ...)` records the metric in the `MeterProvider`'s in-memory buffer
3. The task process exits and the pod is deleted (`delete_worker_pods: True`)
4. The `PeriodicExportingMetricReader` never fires because the process ended before the next export interval

The OTel SDK does not automatically call `force_flush()` on process exit. Since task pods are short-lived, the buffered metrics are lost.

### Proposed fix

Both failure modes can be addressed in core with minimal changes:

#### Fix 1: Reset OTel state after fork (forking executors)

Reset the OTel SDK's provider state in `get_otel_logger()` before calling `set_meter_provider()`. Since `stats.py` only calls the factory after detecting a PID mismatch (i.e., we know we're in a forked child), this is safe:

```python
# otel_logger.py — before metrics.set_meter_provider(...)
import opentelemetry.metrics._internal as _metrics_internal
_metrics_internal._METER_PROVIDER_SET_ONCE._done = False
_metrics_internal._METER_PROVIDER = None
```

#### Fix 2: Flush OTel metrics on task teardown (all executors)

Add a `force_flush()` call in the task runner's shutdown path to ensure pending metrics are exported before the process exits. This benefits all executors but is critical for KubernetesExecutor:

```python
# In the task execution teardown path (e.g., task_runner.py or via atexit)
from opentelemetry.metrics import get_meter_provider

provider = get_meter_provider()
if hasattr(provider, "force_flush"):
    provider.force_flush(timeout_millis=5000)
```

An `atexit` handler registered during OTel initialization in `otel_logger.py` would be the cleanest integration point, ensuring it fires regardless of how the task process exits.

### Operating System

Linux (EKS, Kubernetes)

### Versions of Apache Airflow Providers

N/A (core issue)

### Deployment

Official Apache Airflow Helm Chart

### Anything else

Workaround for forking executors: an Airflow plugin using `os.register_at_fork(after_in_child=...)` to reset the OTel state.

Workaround for KubernetesExecutor: an Airflow plugin that registers an `atexit` handler to call `force_flush()` on the `MeterProvider`.

Both workarounds function but shouldn't be necessary — the fixes belong in core.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTel task-level metrics (ti.finish, ti.start) lost — forked processes and KubernetesExecutor #64690

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Forking executors (LocalExecutor, CeleryExecutor)

KubernetesExecutor

Root cause

Failure mode 1: Forking executors — `Once()` guard survives fork

Failure mode 2: KubernetesExecutor — no flush before pod exit

Proposed fix

Fix 1: Reset OTel state after fork (forking executors)

Fix 2: Flush OTel metrics on task teardown (all executors)

Operating System

Versions of Apache Airflow Providers

Deployment

Anything else

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OTel task-level metrics (ti.finish, ti.start) lost — forked processes and KubernetesExecutor #64690

Description

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Forking executors (LocalExecutor, CeleryExecutor)

KubernetesExecutor

Root cause

Failure mode 1: Forking executors — Once() guard survives fork

Failure mode 2: KubernetesExecutor — no flush before pod exit

Proposed fix

Fix 1: Reset OTel state after fork (forking executors)

Fix 2: Flush OTel metrics on task teardown (all executors)

Operating System

Versions of Apache Airflow Providers

Deployment

Anything else

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Failure mode 1: Forking executors — `Once()` guard survives fork