Skip to content

Scheduler does not release slots correctly when kubernetes sends no reason for container waiting status #60527

@florian-meyrueis-al

Description

@florian-meyrueis-al

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

version 10.5.0.

Assuming it affects all version above

Apache Airflow version

2.11.0, 3.x not tested but it's not an airflow core issue, so should be affected too

Operating System

Debian

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

In the scheduler logs we receive a series of errors like :

2026-01-11 19:11:33.092 | [2026-01-11T19:11:33.091+0000] {kubernetes_executor_utils.py:98} ERROR - Unknown error in KubernetesJobWatcher. Failing |  
-- | -- | --
  |   | 2026-01-11 19:11:33.092 | Traceback (most recent call last): |  
  |   | 2026-01-11 19:11:33.092 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 91, in run |  
  |   | 2026-01-11 19:11:33.092 | self.resource_version = self._run( |  
  |   | 2026-01-11 19:11:33.092 | ^^^^^^^^^^ |  
  |   | 2026-01-11 19:11:33.092 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 171, in _run |  
  |   | 2026-01-11 19:11:33.092 | self.process_status( |  
  |   | 2026-01-11 19:11:33.092 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 249, in process_status |  
  |   | 2026-01-11 19:11:33.092 | container_status_state["waiting"]["reason"] |  
  |   | 2026-01-11 19:11:33.092 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ |  
  |   | 2026-01-11 19:11:33.092 | KeyError: 'reason' |  
  |   | 2026-01-11 19:11:33.093 | Process KubernetesJobWatcher-3: |  
  |   | 2026-01-11 19:11:33.093 | Traceback (most recent call last): |  
  |   | 2026-01-11 19:11:33.093 | File "/usr/local/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap |  
  |   | 2026-01-11 19:11:33.093 | self.run() |  
  |   | 2026-01-11 19:11:33.093 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 91, in run |  
  |   | 2026-01-11 19:11:33.094 | self.resource_version = self._run( |  
  |   | 2026-01-11 19:11:33.094 | ^^^^^^^^^^ |  
  |   | 2026-01-11 19:11:33.094 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 171, in _run |  
  |   | 2026-01-11 19:11:33.094 | self.process_status( |  
  |   | 2026-01-11 19:11:33.094 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 249, in process_status |  
  |   | 2026-01-11 19:11:33.094 | container_status_state["waiting"]["reason"] |  
  |   | 2026-01-11 19:11:33.094 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ |  
  |   | 2026-01-11 19:11:33.094 | KeyError: 'reason'

At this hour most of our dags are started and run for the all night.

In the morning, our monitoring of available slots showed this :

Image

where the blue line is opened execution slots and green line is running execution slots.

The problem is, at that hour (08:00 and after), no dags were running anymore on airflow. Our only solution was to restart the scheduler to get all our opened slots available

What you think should happen instead

The kubernetesjobwatcher should not have crashed because a problem of missing key in the K8s api response and the opened slot should all have been released properly at the end of the dags.

How to reproduce

I don't know.

Anything else

The code of the kubernetes providers should handle correctly optional keys answer from kubernetes.
It this case, k8s api do not enforce "reason" and "message" keys as required in the specification of the ContainerStateWaiting object .

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:providerskind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetprovider:cncf-kubernetesKubernetes (k8s) provider related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions