-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Description
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
version 10.5.0.
Assuming it affects all version above
Apache Airflow version
2.11.0, 3.x not tested but it's not an airflow core issue, so should be affected too
Operating System
Debian
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
What happened
In the scheduler logs we receive a series of errors like :
2026-01-11 19:11:33.092 | [2026-01-11T19:11:33.091+0000] {kubernetes_executor_utils.py:98} ERROR - Unknown error in KubernetesJobWatcher. Failing |
-- | -- | --
| | 2026-01-11 19:11:33.092 | Traceback (most recent call last): |
| | 2026-01-11 19:11:33.092 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 91, in run |
| | 2026-01-11 19:11:33.092 | self.resource_version = self._run( |
| | 2026-01-11 19:11:33.092 | ^^^^^^^^^^ |
| | 2026-01-11 19:11:33.092 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 171, in _run |
| | 2026-01-11 19:11:33.092 | self.process_status( |
| | 2026-01-11 19:11:33.092 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 249, in process_status |
| | 2026-01-11 19:11:33.092 | container_status_state["waiting"]["reason"] |
| | 2026-01-11 19:11:33.092 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ |
| | 2026-01-11 19:11:33.092 | KeyError: 'reason' |
| | 2026-01-11 19:11:33.093 | Process KubernetesJobWatcher-3: |
| | 2026-01-11 19:11:33.093 | Traceback (most recent call last): |
| | 2026-01-11 19:11:33.093 | File "/usr/local/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap |
| | 2026-01-11 19:11:33.093 | self.run() |
| | 2026-01-11 19:11:33.093 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 91, in run |
| | 2026-01-11 19:11:33.094 | self.resource_version = self._run( |
| | 2026-01-11 19:11:33.094 | ^^^^^^^^^^ |
| | 2026-01-11 19:11:33.094 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 171, in _run |
| | 2026-01-11 19:11:33.094 | self.process_status( |
| | 2026-01-11 19:11:33.094 | File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 249, in process_status |
| | 2026-01-11 19:11:33.094 | container_status_state["waiting"]["reason"] |
| | 2026-01-11 19:11:33.094 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^ |
| | 2026-01-11 19:11:33.094 | KeyError: 'reason'
At this hour most of our dags are started and run for the all night.
In the morning, our monitoring of available slots showed this :
where the blue line is opened execution slots and green line is running execution slots.
The problem is, at that hour (08:00 and after), no dags were running anymore on airflow. Our only solution was to restart the scheduler to get all our opened slots available
What you think should happen instead
The kubernetesjobwatcher should not have crashed because a problem of missing key in the K8s api response and the opened slot should all have been released properly at the end of the dags.
How to reproduce
I don't know.
Anything else
The code of the kubernetes providers should handle correctly optional keys answer from kubernetes.
It this case, k8s api do not enforce "reason" and "message" keys as required in the specification of the ContainerStateWaiting object .
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct