Kubernetes Pod Operator: Deferred mode handling of registry rate limiting

### Apache Airflow Provider(s)

cncf-kubernetes

### Versions of Apache Airflow Providers

_No response_

### Apache Airflow version

3

### Operating System

astronomer

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

_No response_

### What happened

When running KPO in deferred mode I ran into an issue caused by rate limits imposed by our docker registry.  When the pod tried to pull the image, kubernetes hit the limit and the triggerer marked the task as failed.  
```
airflow.providers.cncf.kubernetes.kubernetes_helper_functions.PodLaunchFailedException: Pod docker image cannot be pulled, unable to start: ErrImagePull
pull QPS exceeded
```
Coming from 
https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L187-L205

The triggerer then passes back to the operator, however in the time taken for the operator to pick up the task, kubernetes has managed to successfully pull the image and start the pod.  The task outputs some logs from the pod and then just waits for the pod to complete.

I note that https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py#L1003-L1005 suggests we should skip the waiting on ErrImagePull, but https://github.com/apache/airflow/blob/1c41180381a459b77b6d964229bdc19a4a7ec0b3/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L179-L182 returns a launch failure instead of a timeout so hence the waiting.

### What you think should happen instead

1. ErrImagePull should still result in a timeout instead of a failed status.
2. When handing back from the triggerer to the operator if the status is timeout we should still do one more check to see if the pod has started, and if it has we should defer again.

### How to reproduce

-

### Anything else

_No response_

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


	def detect_pod_terminate_early_issues(pod: V1Pod) -> str \| None:
	"""
	Identify issues that justify terminating the pod early.

	:param pod: The pod object to check.
	:return: An error message if an issue is detected; otherwise, None.
	"""
	pod_status = pod.status
	if pod_status.container_statuses:
	for container_status in pod_status.container_statuses:
	container_state: V1ContainerState = container_status.state
	container_waiting: V1ContainerStateWaiting \| None = container_state.waiting
	if container_waiting:
	if container_waiting.reason in ["ErrImagePull", "ImagePullBackOff", "InvalidImageName"]:
	return (
	f"Pod docker image cannot be pulled, unable to start: {container_waiting.reason}"
	f"\n{container_waiting.message}"
	)
	return None

	error_message = detect_pod_terminate_early_issues(remote_pod)
	if error_message:
	pod_manager.log.info("::endgroup::")
	raise PodLaunchFailedException(error_message)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Pod Operator: Deferred mode handling of registry rate limiting #61775

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	# Skip await_pod_completion when the event is 'timeout' due to the pod can hang
	# on the ErrImagePull or ContainerCreating step and it will never complete
	if event["status"] != "timeout":

Kubernetes Pod Operator: Deferred mode handling of registry rate limiting #61775

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions