-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Description
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
No response
Apache Airflow version
3
Operating System
astronomer
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
What happened
When running KPO in deferred mode I ran into an issue caused by rate limits imposed by our docker registry. When the pod tried to pull the image, kubernetes hit the limit and the triggerer marked the task as failed.
airflow.providers.cncf.kubernetes.kubernetes_helper_functions.PodLaunchFailedException: Pod docker image cannot be pulled, unable to start: ErrImagePull
pull QPS exceeded
Coming from
airflow/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py
Lines 187 to 205 in 1c41180
| def detect_pod_terminate_early_issues(pod: V1Pod) -> str | None: | |
| """ | |
| Identify issues that justify terminating the pod early. | |
| :param pod: The pod object to check. | |
| :return: An error message if an issue is detected; otherwise, None. | |
| """ | |
| pod_status = pod.status | |
| if pod_status.container_statuses: | |
| for container_status in pod_status.container_statuses: | |
| container_state: V1ContainerState = container_status.state | |
| container_waiting: V1ContainerStateWaiting | None = container_state.waiting | |
| if container_waiting: | |
| if container_waiting.reason in ["ErrImagePull", "ImagePullBackOff", "InvalidImageName"]: | |
| return ( | |
| f"Pod docker image cannot be pulled, unable to start: {container_waiting.reason}" | |
| f"\n{container_waiting.message}" | |
| ) | |
| return None |
The triggerer then passes back to the operator, however in the time taken for the operator to pick up the task, kubernetes has managed to successfully pull the image and start the pod. The task outputs some logs from the pod and then just waits for the pod to complete.
I note that
airflow/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py
Lines 1003 to 1005 in 1c41180
| # Skip await_pod_completion when the event is 'timeout' due to the pod can hang | |
| # on the ErrImagePull or ContainerCreating step and it will never complete | |
| if event["status"] != "timeout": |
airflow/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py
Lines 179 to 182 in 1c41180
| error_message = detect_pod_terminate_early_issues(remote_pod) | |
| if error_message: | |
| pod_manager.log.info("::endgroup::") | |
| raise PodLaunchFailedException(error_message) |
What you think should happen instead
- ErrImagePull should still result in a timeout instead of a failed status.
- When handing back from the triggerer to the operator if the status is timeout we should still do one more check to see if the pod has started, and if it has we should defer again.
How to reproduce
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct