Skip to content

Kubernetes Pod Operator: Deferred mode handling of registry rate limiting #61775

@johnhoran

Description

@johnhoran

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

No response

Apache Airflow version

3

Operating System

astronomer

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

When running KPO in deferred mode I ran into an issue caused by rate limits imposed by our docker registry. When the pod tried to pull the image, kubernetes hit the limit and the triggerer marked the task as failed.

airflow.providers.cncf.kubernetes.kubernetes_helper_functions.PodLaunchFailedException: Pod docker image cannot be pulled, unable to start: ErrImagePull
pull QPS exceeded

Coming from

def detect_pod_terminate_early_issues(pod: V1Pod) -> str | None:
"""
Identify issues that justify terminating the pod early.
:param pod: The pod object to check.
:return: An error message if an issue is detected; otherwise, None.
"""
pod_status = pod.status
if pod_status.container_statuses:
for container_status in pod_status.container_statuses:
container_state: V1ContainerState = container_status.state
container_waiting: V1ContainerStateWaiting | None = container_state.waiting
if container_waiting:
if container_waiting.reason in ["ErrImagePull", "ImagePullBackOff", "InvalidImageName"]:
return (
f"Pod docker image cannot be pulled, unable to start: {container_waiting.reason}"
f"\n{container_waiting.message}"
)
return None

The triggerer then passes back to the operator, however in the time taken for the operator to pick up the task, kubernetes has managed to successfully pull the image and start the pod. The task outputs some logs from the pod and then just waits for the pod to complete.

I note that

# Skip await_pod_completion when the event is 'timeout' due to the pod can hang
# on the ErrImagePull or ContainerCreating step and it will never complete
if event["status"] != "timeout":
suggests we should skip the waiting on ErrImagePull, but
error_message = detect_pod_terminate_early_issues(remote_pod)
if error_message:
pod_manager.log.info("::endgroup::")
raise PodLaunchFailedException(error_message)
returns a launch failure instead of a timeout so hence the waiting.

What you think should happen instead

  1. ErrImagePull should still result in a timeout instead of a failed status.
  2. When handing back from the triggerer to the operator if the status is timeout we should still do one more check to see if the pod has started, and if it has we should defer again.

How to reproduce

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions