KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397

AutomationDev85 · 2025-11-17T13:45:30Z

Overview

We are encountering "Too Many Requests" (HTTP 429) errors from the Kubernetes API when scaling up nodes in our Kubernetes cluster. We introduced already the retry handling in the PodManager with this PR: #58033
but we found also the issue that the PodManager access the KuberentesHook api client direct and so the idea is to add the retry on the Hook level and catch only the spezific funtions on the PodManager with direct access to the API Client with retry handling.

This PR also changes the handling of the retries so that only retry worth statuscodes and errors are retried.

We are encountering frequent HTTP 429 “Too Many Requests” responses from the Kubernetes API during node scale-up operations. A prior change (see PR #58033) introduced retry handling in the PodManager. But some PodManager methods bypassed that logic by using the KubernetesHook API client directly. This change moves the primary retry mechanism into the KubernetesHook and adds targeted retries only for PodManager methods that invoke the API client directly.

Retry behavior is refined to act only on retry-worthy status codes and errors.

We welcome your feedback on this change!

Details of change:

Retry logic centralized at the KubernetesHook level.
PodManager now retries only for methods that directly invoke the Kubernetes API client.
Retries limited to transient, retry-worthy status codes and network errors.

jscheffl

Looks very good to me. Thanks for the cleanup and consolidation!

I leave this open for some-second pair of eyes for some days. Else would propose to merge prior next provider wave.

One small nit but rather non-blocking.

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/kubernetes_helper_functions.py

providers/cncf/kubernetes/tests/unit/cncf/kubernetes/utils/test_pod_manager.py

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/kubernetes_helper_functions.py

jscheffl

Very cool now! Thanks!

AutomationDev85 · 2025-11-20T13:24:37Z

Sorry found out that I missed to check for the async exception type. I updated the code to catch sync and async exceptions.

jscheffl · 2025-11-20T20:56:35Z

Sorry found out that I missed to check for the async exception type. I updated the code to catch sync and async exceptions.

Thanks for updating!
Do you think it is stable now from your experience? Shall we keep the PR open a moment or is it finally ready?
(I checked the code now 3 times and every time I am more excited)

potiuk

It looks good from my slde as well.

potiuk · 2025-11-22T22:00:17Z

It looks like it is handling a LOT more common issues in a LOT more generic way.

…e#58397) * Move retry handling to the hook layer and update PodManager accordingly * Removed overlapping code * Clean up code * Detailed logging and use of autouse fixture * move no wait fixture into conftest * Disabled no_retry_wait patch for explicitly marked unit tests. * Fix unit test * Generic retry logic can handle async and sync kubernetes api exceptions --------- Co-authored-by: AutomationDev85 <AutomationDev85>

AutomationDev85 added 2 commits November 17, 2025 14:23

Move retry handling to the hook layer and update PodManager accordingly

4abf3e7

Removed overlapping code

181157c

AutomationDev85 requested review from hussein-awala and jedcunningham as code owners November 17, 2025 13:45

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Nov 17, 2025

Clean up code

c15e9a4

jscheffl approved these changes Nov 18, 2025

View reviewed changes

jscheffl reviewed Nov 19, 2025

View reviewed changes

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/kubernetes_helper_functions.py Show resolved Hide resolved

AutomationDev85 changed the title ~~KuberetesPodOperator: Rework of Kubernetes API retry behavior~~ KubernetesPodOperator: Rework of Kubernetes API retry behavior Nov 19, 2025

AutomationDev85 added 4 commits November 19, 2025 12:55

Detailed logging and use of autouse fixture

048b3e0

move no wait fixture into conftest

ac3276a

Disabled no_retry_wait patch for explicitly marked unit tests.

f5fbdd1

Fix unit test

91f33e7

jscheffl approved these changes Nov 19, 2025

View reviewed changes

Generic retry logic can handle async and sync kubernetes api exceptions

da79578

jscheffl requested a review from potiuk November 20, 2025 20:56

potiuk reviewed Nov 22, 2025

View reviewed changes

potiuk approved these changes Nov 22, 2025

View reviewed changes

jscheffl merged commit bb2cc41 into apache:main Nov 22, 2025
95 checks passed

This was referenced Nov 27, 2025

Status of testing Providers that were prepared on November 27, 2025 #58792

Closed

Status of testing Providers that were prepared on December 01, 2025 #58912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397

KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397

Uh oh!

AutomationDev85 commented Nov 17, 2025

Uh oh!

jscheffl left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jscheffl left a comment

Uh oh!

AutomationDev85 commented Nov 20, 2025

Uh oh!

jscheffl commented Nov 20, 2025

Uh oh!

potiuk left a comment

Uh oh!

potiuk commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397

KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397

Uh oh!

Conversation

AutomationDev85 commented Nov 17, 2025

Overview

Details of change:

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

AutomationDev85 commented Nov 20, 2025

Uh oh!

jscheffl commented Nov 20, 2025

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

potiuk commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants