-
Notifications
You must be signed in to change notification settings - Fork 16.4k
KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KubernetesPodOperator: Rework of Kubernetes API retry behavior #58397
Conversation
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me. Thanks for the cleanup and consolidation!
I leave this open for some-second pair of eyes for some days. Else would propose to merge prior next provider wave.
One small nit but rather non-blocking.
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/kubernetes_helper_functions.py
Show resolved
Hide resolved
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/utils/test_pod_manager.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/kubernetes_helper_functions.py
Show resolved
Hide resolved
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool now! Thanks!
|
Sorry found out that I missed to check for the async exception type. I updated the code to catch sync and async exceptions. |
Thanks for updating! |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good from my slde as well.
|
It looks like it is handling a LOT more common issues in a LOT more generic way. |
…e#58397) * Move retry handling to the hook layer and update PodManager accordingly * Removed overlapping code * Clean up code * Detailed logging and use of autouse fixture * move no wait fixture into conftest * Disabled no_retry_wait patch for explicitly marked unit tests. * Fix unit test * Generic retry logic can handle async and sync kubernetes api exceptions --------- Co-authored-by: AutomationDev85 <AutomationDev85>
…e#58397) * Move retry handling to the hook layer and update PodManager accordingly * Removed overlapping code * Clean up code * Detailed logging and use of autouse fixture * move no wait fixture into conftest * Disabled no_retry_wait patch for explicitly marked unit tests. * Fix unit test * Generic retry logic can handle async and sync kubernetes api exceptions --------- Co-authored-by: AutomationDev85 <AutomationDev85>
Overview
We are encountering "Too Many Requests" (HTTP 429) errors from the Kubernetes API when scaling up nodes in our Kubernetes cluster. We introduced already the retry handling in the PodManager with this PR: #58033
but we found also the issue that the PodManager access the KuberentesHook api client direct and so the idea is to add the retry on the Hook level and catch only the spezific funtions on the PodManager with direct access to the API Client with retry handling.
This PR also changes the handling of the retries so that only retry worth statuscodes and errors are retried.
We are encountering frequent HTTP 429 “Too Many Requests” responses from the Kubernetes API during node scale-up operations. A prior change (see PR #58033) introduced retry handling in the PodManager. But some PodManager methods bypassed that logic by using the KubernetesHook API client directly. This change moves the primary retry mechanism into the KubernetesHook and adds targeted retries only for PodManager methods that invoke the API client directly.
Retry behavior is refined to act only on retry-worthy status codes and errors.
We welcome your feedback on this change!
Details of change: