Apache Airflow version
3.1.8
What happened
Task-level OTel metrics (e.g. ti.finish, ti.start) are silently lost and never reach the OTel collector. This affects all executors through two distinct failure modes:
-
Forking executors (LocalExecutor, CeleryExecutor): The OTel SDK's Once() guard prevents set_meter_provider() from working in forked child processes, leaving the child with a stale MeterProvider whose export thread is dead after fork.
-
KubernetesExecutor: Each task pod initializes a fresh MeterProvider (no fork issue), but the pod terminates before the PeriodicExportingMetricReader fires. ti.finish is recorded at the very end of task execution, and the pod exits before the export interval (default 60s, commonly configured to 30s) elapses — so the metric is buffered but never sent.
The result in both cases: ti.finish and ti.start metrics are missing from Prometheus/Grafana, while scheduler-level metrics (scheduler_heartbeat, executor.running_tasks, pool.*, etc.) work fine since they're emitted from long-lived processes.
What you think should happen instead
Task-level metrics emitted from any executor should be reliably exported to the OTel collector before the task process or pod exits.
How to reproduce
Forking executors (LocalExecutor, CeleryExecutor)
- Configure Airflow 3.x with OTel metrics enabled (
otel_on = True)
- Run any DAG with LocalExecutor or CeleryExecutor
- Observe task logs show:
INFO - Stats instance was created in PID 7 but accessed in PID 19. Re-initializing.
INFO - [Metric Exporter] Connecting to OpenTelemetry Collector at http://...
WARNING - Overriding of current MeterProvider is not allowed
- Check Grafana/Prometheus —
ti.finish metrics are missing
KubernetesExecutor
- Configure Airflow 3.x with OTel metrics enabled and KubernetesExecutor
- Run any DAG — tasks execute in individual pods
- No warning is logged (the
MeterProvider initializes successfully in a fresh process)
- Check Grafana/Prometheus —
ti.finish metrics are still missing because the pod exited before the periodic export
Root cause
Failure mode 1: Forking executors — Once() guard survives fork
airflow/stats.py correctly detects PID mismatches after fork and re-initializes the Stats instance by calling otel_logger.get_otel_logger(). This creates a fresh MeterProvider and calls metrics.set_meter_provider().
However, the OTel Python SDK uses a Once() guard (opentelemetry/metrics/_internal/__init__.py):
_METER_PROVIDER_SET_ONCE = Once()
def set_meter_provider(meter_provider):
def set_mp():
global _METER_PROVIDER
_METER_PROVIDER = meter_provider
_PROXY_METER_PROVIDER.on_set_meter_provider(meter_provider)
did_set = _METER_PROVIDER_SET_ONCE.do_once(set_mp)
if not did_set:
_logger.warning("Overriding of current MeterProvider is not allowed")
The Once._done = True flag from the parent process survives fork(), so the child's set_meter_provider() silently fails. The child ends up using the parent's stale MeterProvider whose PeriodicExportingMetricReader background thread is dead after fork.
The code path:
stats.py:55-64 — detects PID mismatch, sets cls.instance = None, calls factory
otel_logger.py:410 — creates new MeterProvider, calls metrics.set_meter_provider()
- OTel SDK
Once().do_once() — returns False because _done was inherited from parent
otel_logger.py returns SafeOtelLogger(metrics.get_meter_provider(), ...) — gets stale parent provider
task_runner.py:1195 — Stats.incr("ti.finish", ...) → dead exporter → metrics lost
Failure mode 2: KubernetesExecutor — no flush before pod exit
With KubernetesExecutor, there is no fork — each task runs in a fresh pod with a correctly initialized MeterProvider. The problem is timing:
- Task executes and completes
Stats.incr("ti.finish", ...) records the metric in the MeterProvider's in-memory buffer
- The task process exits and the pod is deleted (
delete_worker_pods: True)
- The
PeriodicExportingMetricReader never fires because the process ended before the next export interval
The OTel SDK does not automatically call force_flush() on process exit. Since task pods are short-lived, the buffered metrics are lost.
Proposed fix
Both failure modes can be addressed in core with minimal changes:
Fix 1: Reset OTel state after fork (forking executors)
Reset the OTel SDK's provider state in get_otel_logger() before calling set_meter_provider(). Since stats.py only calls the factory after detecting a PID mismatch (i.e., we know we're in a forked child), this is safe:
# otel_logger.py — before metrics.set_meter_provider(...)
import opentelemetry.metrics._internal as _metrics_internal
_metrics_internal._METER_PROVIDER_SET_ONCE._done = False
_metrics_internal._METER_PROVIDER = None
Fix 2: Flush OTel metrics on task teardown (all executors)
Add a force_flush() call in the task runner's shutdown path to ensure pending metrics are exported before the process exits. This benefits all executors but is critical for KubernetesExecutor:
# In the task execution teardown path (e.g., task_runner.py or via atexit)
from opentelemetry.metrics import get_meter_provider
provider = get_meter_provider()
if hasattr(provider, "force_flush"):
provider.force_flush(timeout_millis=5000)
An atexit handler registered during OTel initialization in otel_logger.py would be the cleanest integration point, ensuring it fires regardless of how the task process exits.
Operating System
Linux (EKS, Kubernetes)
Versions of Apache Airflow Providers
N/A (core issue)
Deployment
Official Apache Airflow Helm Chart
Anything else
Workaround for forking executors: an Airflow plugin using os.register_at_fork(after_in_child=...) to reset the OTel state.
Workaround for KubernetesExecutor: an Airflow plugin that registers an atexit handler to call force_flush() on the MeterProvider.
Both workarounds function but shouldn't be necessary — the fixes belong in core.
Apache Airflow version
3.1.8
What happened
Task-level OTel metrics (e.g.
ti.finish,ti.start) are silently lost and never reach the OTel collector. This affects all executors through two distinct failure modes:Forking executors (LocalExecutor, CeleryExecutor): The OTel SDK's
Once()guard preventsset_meter_provider()from working in forked child processes, leaving the child with a staleMeterProviderwhose export thread is dead after fork.KubernetesExecutor: Each task pod initializes a fresh
MeterProvider(no fork issue), but the pod terminates before thePeriodicExportingMetricReaderfires.ti.finishis recorded at the very end of task execution, and the pod exits before the export interval (default 60s, commonly configured to 30s) elapses — so the metric is buffered but never sent.The result in both cases:
ti.finishandti.startmetrics are missing from Prometheus/Grafana, while scheduler-level metrics (scheduler_heartbeat,executor.running_tasks,pool.*, etc.) work fine since they're emitted from long-lived processes.What you think should happen instead
Task-level metrics emitted from any executor should be reliably exported to the OTel collector before the task process or pod exits.
How to reproduce
Forking executors (LocalExecutor, CeleryExecutor)
otel_on = True)ti.finishmetrics are missingKubernetesExecutor
MeterProviderinitializes successfully in a fresh process)ti.finishmetrics are still missing because the pod exited before the periodic exportRoot cause
Failure mode 1: Forking executors —
Once()guard survives forkairflow/stats.pycorrectly detects PID mismatches after fork and re-initializes the Stats instance by callingotel_logger.get_otel_logger(). This creates a freshMeterProviderand callsmetrics.set_meter_provider().However, the OTel Python SDK uses a
Once()guard (opentelemetry/metrics/_internal/__init__.py):The
Once._done = Trueflag from the parent process survivesfork(), so the child'sset_meter_provider()silently fails. The child ends up using the parent's staleMeterProviderwhosePeriodicExportingMetricReaderbackground thread is dead after fork.The code path:
stats.py:55-64— detects PID mismatch, setscls.instance = None, calls factoryotel_logger.py:410— creates newMeterProvider, callsmetrics.set_meter_provider()Once().do_once()— returnsFalsebecause_donewas inherited from parentotel_logger.pyreturnsSafeOtelLogger(metrics.get_meter_provider(), ...)— gets stale parent providertask_runner.py:1195—Stats.incr("ti.finish", ...)→ dead exporter → metrics lostFailure mode 2: KubernetesExecutor — no flush before pod exit
With KubernetesExecutor, there is no fork — each task runs in a fresh pod with a correctly initialized
MeterProvider. The problem is timing:Stats.incr("ti.finish", ...)records the metric in theMeterProvider's in-memory bufferdelete_worker_pods: True)PeriodicExportingMetricReadernever fires because the process ended before the next export intervalThe OTel SDK does not automatically call
force_flush()on process exit. Since task pods are short-lived, the buffered metrics are lost.Proposed fix
Both failure modes can be addressed in core with minimal changes:
Fix 1: Reset OTel state after fork (forking executors)
Reset the OTel SDK's provider state in
get_otel_logger()before callingset_meter_provider(). Sincestats.pyonly calls the factory after detecting a PID mismatch (i.e., we know we're in a forked child), this is safe:Fix 2: Flush OTel metrics on task teardown (all executors)
Add a
force_flush()call in the task runner's shutdown path to ensure pending metrics are exported before the process exits. This benefits all executors but is critical for KubernetesExecutor:An
atexithandler registered during OTel initialization inotel_logger.pywould be the cleanest integration point, ensuring it fires regardless of how the task process exits.Operating System
Linux (EKS, Kubernetes)
Versions of Apache Airflow Providers
N/A (core issue)
Deployment
Official Apache Airflow Helm Chart
Anything else
Workaround for forking executors: an Airflow plugin using
os.register_at_fork(after_in_child=...)to reset the OTel state.Workaround for KubernetesExecutor: an Airflow plugin that registers an
atexithandler to callforce_flush()on theMeterProvider.Both workarounds function but shouldn't be necessary — the fixes belong in core.