Skip to content

Conversation

@peloyeje
Copy link
Contributor

@peloyeje peloyeje commented Jul 19, 2024

Fix based on a real world issue seen in production where a failing pod does not make the associated task fail

Running theory: then a pod fails while in self.pod_manager.fetch_container_logs, running property of the returned pod_log_status object is False, hence we skip the deferrable call and jump directly to self._clean
But the issue is that the event object is never refreshed and still carries the running status, hence hitting this code path:

if event["status"] == "running":
return

and making the task instance returns without error

Proposed fix: call defer whatever the pod status is after fetching logs, so that the fail status is picked up during the next trigger run
It adds a bit of delay to the pod completion detection but is simple/stupid :)


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jul 19, 2024
if pod_log_status.running:
self.log.info("Container still running; deferring again.")
self.invoke_defer_method(pod_log_status.last_log_time)
self.invoke_defer_method(pod_log_status.last_log_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit test to cover this change?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 24, 2024
@github-actions github-actions bot closed this Oct 1, 2024
@peloyeje peloyeje deleted the fix/success-when-pod-fails-while-fetching-logs branch October 12, 2024 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants