-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Ensure AsyncKubernetesHook.watch_pod_events streams events until pod completion/termination #60532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure AsyncKubernetesHook.watch_pod_events streams events until pod completion/termination #60532
Conversation
…nts when the Kubernetes watch stream ended (e.g. due to timeout_seconds), even while the pod was still running. This change reconnects on watch termination, resumes from the last observed resourceVersion, restarts on stale resourceVersion errors (410), and stops only when the pod completes or is deleted. Permission-denied watches still fall back to polling. As part of this fix, kubeconfig loading is cached and _load_config no longer returns an API client, clarifying its responsibility and avoiding repeated config loading during watch reconnects. Tests cover reconnection behavior, stale resourceVersion recovery, and clean termination on pod completion or deletion.
|
Requesting review for this. |
|
Perhaps you could help with review? |
|
@SameerMesiah97 I can but please be aware of that I do Airflow reviews and contributions as a side-job so please include some patience for reviews. |
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Show resolved
Hide resolved
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me by pure code reading the PR looks okay for me. I am not sure and can you tell me why you made changes in config loading in parallel to the changes in watching? Feeling like this might be a reason which I do not see and might be better a separate PR.
As I am not an expert on K9s event watch, this feature was introduced by @wolfdn Can you help me with an opinion on this? Was there a reason that you did not make a watch loop for the events?
The modified implementation for
I would love to hear feedback from @wolfdn on this as well. |
|
@SameerMesiah97 Looks good to me! There was no specific reason to not create a watch loop for the events - such a loop is indeed necessary to pick up the watch connection again after it has timed out / terminated for some other reason. |
…nts when (apache#60532) the Kubernetes watch stream ended (e.g. due to timeout_seconds), even while the pod was still running. This change reconnects on watch termination, resumes from the last observed resourceVersion, restarts on stale resourceVersion errors (410), and stops only when the pod completes or is deleted. Permission-denied watches still fall back to polling. As part of this fix, kubeconfig loading is cached and _load_config no longer returns an API client, clarifying its responsibility and avoiding repeated config loading during watch reconnects. Tests cover reconnection behavior, stale resourceVersion recovery, and clean termination on pod completion or deletion. Co-authored-by: Sameer Mesiah <smesiah971@gmail.com>
Description
This change refactors
watch_pod_eventsso that it continues watching events for the full lifecycle of the target pod, rather than stopping after a single watch stream terminates.The new implementation now:
This ensures that
watch_pod_eventscontinues yielding events for the full lifecycle of a pod instead of silently stopping aftertimeout_seconds.Rationale
The Kubernetes Watch API enforces server-side timeouts, meaning a single watch stream is not guaranteed to remain open indefinitely. The previous implementation treated timeout_seconds as an implicit upper bound on the total duration of event streaming, causing the generator to stop yielding events after the first watch termination — even while the pod was still running.
This behavior is surprising and contradicts what users reasonably expect from the method name (
watch_pod_events), the docstring and standard Kubernetes watch semantics. The updated implementation aligns with Kubernetes best practices by treating watch termination as a recoverable condition and transparently reconnecting until the pod reaches a terminal lifecycle state.Backwards Compatibility
This change does not alter the public API or method signature. However, it does change runtime behavior:
timeout_secondsnow applies only to individual watch connections, not the overall duration of event streaming.While it is possible that some users rely on the previous behavior, it is more likely that existing deployments have implemented workarounds (e.g. external loops or polling) to compensate for the premature termination. The new behavior is consistent with documented intent and Kubernetes conventions, and therefore adheres to the principle of least surprise.
Tests
Added unit tests to validate the following expected behaviors:
Existing tests have been updated to account for the addition of pod state inspection in
watch_pod_events.Notes
_load_configis now guarded to ensure configuration is loaded once per hook instance.; it no longer returns an API client. API client instantiation is now solely the responsibility ofget_conn, enabling reconnection inwatch_pod_eventswithout redundant configuration reloads.api_client_from_kubeconfig_fileused to construct and return an API client from_load_confighas been removed. The call site of this helper has been replaced with a call toasync_config.load_kube_config.Closes: #60495