Skip to content

Conversation

@m1racoli
Copy link
Contributor

@m1racoli m1racoli commented Oct 21, 2025

While KubernetesPodTrigger is polling the KPO pod, in case of large values for poll_interval (i.e 900 seconds) it can happen that the completed podhas been already been cleaned up in-between polls. This causes the following chain of events:

  1. 404 pod not found error inside KubernetesPodTrigger
  2. KubernetesPodOperator.trigger_reentry ending up in another 404 pod not found in self.hook.get_pod
  3. KubernetesPodOperator._clean being called as part of the finally block
  4. KubernetesPodOperator.pod_manager.await_pod_completion failing to handle self.pod == None ending up in AttributeError: 'NoneType' object has no attribute 'metadata'
  5. the stack trace of the original error in 1. is never properly printed

We improve this situation with the following adjustments:

  • log the original exception with stack trace in the trigger for better visibility of the original error
  • log the actual poll interval being used when starting the trigger
  • return from KubernetesPodOperator._call early if self.pod is None

This makes the error's stacktrace easier to review in the logs and does
not depend on `trigger_reentry` to do so.
This makes it easier to know which value for poll_interval is actually
being used.
It can happen that the pod doesn't exist anymore when running `trigger_reentry`.

In that case `self.hook.get_pod` will cause a 404 API exception, which
then will end up calling `self._clean` which assumed `self.pod` not to be None.
@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Oct 21, 2025
@ashb ashb merged commit 54fa258 into apache:main Oct 28, 2025
91 checks passed
Lzzz666 pushed a commit to Lzzz666/airflow that referenced this pull request Oct 30, 2025
…che#56976)

* refactor: log exception in KubernetesPodTrigger

This makes the error's stacktrace easier to review in the logs and does
not depend on `trigger_reentry` to do so.

* refactor: log poll_interval on trigger start

This makes it easier to know which value for poll_interval is actually
being used.

* fix: handle missing pod when trying to cleanup after trigger reentry

It can happen that the pod doesn't exist anymore when running `trigger_reentry`.

In that case `self.hook.get_pod` will cause a 404 API exception, which
then will end up calling `self._clean` which assumed `self.pod` not to be None.

* fix: don't use structured logging when logging poll interval

This doesn't work in Airflow 2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants