Skip to content

Conversation

@AutomationDev85
Copy link
Contributor

Overview

We are encountering "Too Many Requests" (HTTP 429) errors from the Kubernetes API when scaling up nodes in our Kubernetes cluster. We introduced already the retry handling in the PodManager with this PR: #58033
but we found also the issue that the PodManager access the KuberentesHook api client direct and so the idea is to add the retry on the Hook level and catch only the spezific funtions on the PodManager with direct access to the API Client with retry handling.

This PR also changes the handling of the retries so that only retry worth statuscodes and errors are retried.

We are encountering frequent HTTP 429 “Too Many Requests” responses from the Kubernetes API during node scale-up operations. A prior change (see PR #58033) introduced retry handling in the PodManager. But some PodManager methods bypassed that logic by using the KubernetesHook API client directly. This change moves the primary retry mechanism into the KubernetesHook and adds targeted retries only for PodManager methods that invoke the API client directly.

Retry behavior is refined to act only on retry-worthy status codes and errors.

We welcome your feedback on this change!

Details of change:

  • Retry logic centralized at the KubernetesHook level.
  • PodManager now retries only for methods that directly invoke the Kubernetes API client.
  • Retries limited to transient, retry-worthy status codes and network errors.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me. Thanks for the cleanup and consolidation!

I leave this open for some-second pair of eyes for some days. Else would propose to merge prior next provider wave.

One small nit but rather non-blocking.

@AutomationDev85 AutomationDev85 changed the title KuberetesPodOperator: Rework of Kubernetes API retry behavior KubernetesPodOperator: Rework of Kubernetes API retry behavior Nov 19, 2025
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool now! Thanks!

@AutomationDev85
Copy link
Contributor Author

Sorry found out that I missed to check for the async exception type. I updated the code to catch sync and async exceptions.

@jscheffl
Copy link
Contributor

Sorry found out that I missed to check for the async exception type. I updated the code to catch sync and async exceptions.

Thanks for updating!
Do you think it is stable now from your experience? Shall we keep the PR open a moment or is it finally ready?
(I checked the code now 3 times and every time I am more excited)

@jscheffl jscheffl requested a review from potiuk November 20, 2025 20:56
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good from my slde as well.

@potiuk
Copy link
Member

potiuk commented Nov 22, 2025

It looks like it is handling a LOT more common issues in a LOT more generic way.

@jscheffl jscheffl merged commit bb2cc41 into apache:main Nov 22, 2025
95 checks passed
Copilot AI pushed a commit to jason810496/airflow that referenced this pull request Dec 5, 2025
…e#58397)

* Move retry handling to the hook layer and update PodManager accordingly

* Removed overlapping code

* Clean up code

* Detailed logging and use of autouse fixture

* move no wait fixture into conftest

* Disabled no_retry_wait patch for explicitly marked unit tests.

* Fix unit test

* Generic retry logic can handle async and sync kubernetes api exceptions

---------

Co-authored-by: AutomationDev85 <AutomationDev85>
itayweb pushed a commit to itayweb/airflow that referenced this pull request Dec 6, 2025
…e#58397)

* Move retry handling to the hook layer and update PodManager accordingly

* Removed overlapping code

* Clean up code

* Detailed logging and use of autouse fixture

* move no wait fixture into conftest

* Disabled no_retry_wait patch for explicitly marked unit tests.

* Fix unit test

* Generic retry logic can handle async and sync kubernetes api exceptions

---------

Co-authored-by: AutomationDev85 <AutomationDev85>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants