Skip to content

Conversation

@SameerMesiah97
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 commented Jan 14, 2026

Description

This change refactors watch_pod_events so that it continues watching events for the full lifecycle of the target pod, rather than stopping after a single watch stream terminates.

The new implementation now:

  • Reconnects automatically when a watch stream terminates (e.g. server-side timeout).
  • Resumes watching from the last observed resourceVersion.
  • Handles Kubernetes 410 Gone errors by restarting the watch from the current state.
  • Terminates cleanly when the pod completes or is deleted.

This ensures that watch_pod_events continues yielding events for the full lifecycle of a pod instead of silently stopping after timeout_seconds.

Rationale

The Kubernetes Watch API enforces server-side timeouts, meaning a single watch stream is not guaranteed to remain open indefinitely. The previous implementation treated timeout_seconds as an implicit upper bound on the total duration of event streaming, causing the generator to stop yielding events after the first watch termination — even while the pod was still running.

This behavior is surprising and contradicts what users reasonably expect from the method name (watch_pod_events), the docstring and standard Kubernetes watch semantics. The updated implementation aligns with Kubernetes best practices by treating watch termination as a recoverable condition and transparently reconnecting until the pod reaches a terminal lifecycle state.

Backwards Compatibility

This change does not alter the public API or method signature. However, it does change runtime behavior:

  • timeout_seconds now applies only to individual watch connections, not the overall duration of event streaming.
  • Event streaming continues until pod completion or deletion instead of stopping silently after a timeout.

While it is possible that some users rely on the previous behavior, it is more likely that existing deployments have implemented workarounds (e.g. external loops or polling) to compensate for the premature termination. The new behavior is consistent with documented intent and Kubernetes conventions, and therefore adheres to the principle of least surprise.

Tests

Added unit tests to validate the following expected behaviors:

  • Reconnects and continues streaming events after a watch stream ends (e.g. timeout).
  • Restarts the watch when Kubernetes returns 410 Gone due to a stale resourceVersion.
  • Stops cleanly when the pod is deleted (404).
  • Stops immediately when the pod reaches a terminal phase (Succeeded or Failed).

Existing tests have been updated to account for the addition of pod state inspection in watch_pod_events.

Notes

  • _load_config is now guarded to ensure configuration is loaded once per hook instance.; it no longer returns an API client. API client instantiation is now solely the responsibility of get_conn, enabling reconnection in watch_pod_events without redundant configuration reloads.
  • The internal helper api_client_from_kubeconfig_file used to construct and return an API client from _load_config has been removed. The call site of this helper has been replaced with a call to async_config.load_kube_config.
  • The exception message raised when multiple configuration sources are supplied has been clarified to more accurately describe the error.
  • Polling fallback behavior is preserved and now continues until the pod reaches a terminal lifecycle state, matching the updated watch semantics.

Closes: #60495

…nts when

the Kubernetes watch stream ended (e.g. due to timeout_seconds), even while
the pod was still running.

This change reconnects on watch termination, resumes from the last observed
resourceVersion, restarts on stale resourceVersion errors (410), and stops
only when the pod completes or is deleted. Permission-denied watches still fall
back to polling.

As part of this fix, kubeconfig loading is cached and _load_config no longer
returns an API client, clarifying its responsibility and avoiding repeated
config loading during watch reconnects.

Tests cover reconnection behavior, stale resourceVersion recovery, and clean
termination on pod completion or deletion.
@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jan 14, 2026
@SameerMesiah97 SameerMesiah97 changed the title Ensure AsyncKubernetesHook.watch_pod_events streams events until pod termination Ensure AsyncKubernetesHook.watch_pod_events streams events until pod completion/termination Jan 14, 2026
@SameerMesiah97
Copy link
Contributor Author

Requesting review for this.

@SameerMesiah97
Copy link
Contributor Author

@jscheffl

Perhaps you could help with review?

@jscheffl
Copy link
Contributor

@SameerMesiah97 I can but please be aware of that I do Airflow reviews and contributions as a side-job so please include some patience for reviews.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me by pure code reading the PR looks okay for me. I am not sure and can you tell me why you made changes in config loading in parallel to the changes in watching? Feeling like this might be a reason which I do not see and might be better a separate PR.

As I am not an expert on K9s event watch, this feature was introduced by @wolfdn Can you help me with an opinion on this? Was there a reason that you did not make a watch loop for the events?

@SameerMesiah97
Copy link
Contributor Author

SameerMesiah97 commented Jan 24, 2026

For me by pure code reading the PR looks okay for me. I am not sure and can you tell me why you made changes in config loading in parallel to the changes in watching? Feeling like this might be a reason which I do not see and might be better a separate PR.

The modified implementation for watch_pod_events will now call _load_config mutliple times through get_conn. It was indeed possible for my new implementation to work without modifying _load_config. In fact, this is what I initially implemented. But then I observed that the task logs were being polluted with logs from config loading along with the event messages from the target pod. Since this is an observability function, I added a small config-loading guard to ensure that the config is loaded once per hook instance and avoids repeated log spam when the watch reconnects. That’s why the changes ended up in the same PR.

As I am not an expert on K9s event watch, this feature was introduced by @wolfdn Can you help me with an opinion on this? Was there a reason that you did not make a watch loop for the events?

I would love to hear feedback from @wolfdn on this as well.

@wolfdn
Copy link
Contributor

wolfdn commented Jan 26, 2026

@SameerMesiah97 Looks good to me! There was no specific reason to not create a watch loop for the events - such a loop is indeed necessary to pick up the watch connection again after it has timed out / terminated for some other reason.
Thanks for fixing this!

@jscheffl jscheffl merged commit 52867d9 into apache:main Jan 26, 2026
106 checks passed
shreyas-dev pushed a commit to shreyas-dev/airflow that referenced this pull request Jan 29, 2026
…nts when (apache#60532)

the Kubernetes watch stream ended (e.g. due to timeout_seconds), even while
the pod was still running.

This change reconnects on watch termination, resumes from the last observed
resourceVersion, restarts on stale resourceVersion errors (410), and stops
only when the pod completes or is deleted. Permission-denied watches still fall
back to polling.

As part of this fix, kubeconfig loading is cached and _load_config no longer
returns an API client, clarifying its responsibility and avoiding repeated
config loading during watch reconnects.

Tests cover reconnection behavior, stale resourceVersion recovery, and clean
termination on pod completion or deletion.

Co-authored-by: Sameer Mesiah <smesiah971@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

def watch_pod_events from AsyncKubernetesHook stops after timeout_seconds even when pod is still running

3 participants