fix: always defer once more after log fetching to ensure pod completion is handled #40891

peloyeje · 2024-07-19T14:26:05Z

Fix based on a real world issue seen in production where a failing pod does not make the associated task fail

Running theory: then a pod fails while in self.pod_manager.fetch_container_logs, running property of the returned pod_log_status object is False, hence we skip the deferrable call and jump directly to self._clean
But the issue is that the event object is never refreshed and still carries the running status, hence hitting this code path:

airflow/airflow/providers/cncf/kubernetes/operators/pod.py

Lines 793 to 794 in 4cbfcd7

    
           if event["status"] == "running": 
        
               return

and making the task instance returns without error

Proposed fix: call defer whatever the pod status is after fetching logs, so that the fail status is picked up during the next trigger run
It adds a bit of delay to the pod completion detection but is simple/stupid :)

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…on is handled

eladkal · 2024-08-08T07:38:37Z

airflow/providers/cncf/kubernetes/operators/pod.py

-                    if pod_log_status.running:
-                        self.log.info("Container still running; deferring again.")
-                        self.invoke_defer_method(pod_log_status.last_log_time)
+                    self.invoke_defer_method(pod_log_status.last_log_time)


Can you add unit test to cover this change?

github-actions · 2024-09-24T18:39:56Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

fix: always defer once more after log fetching to ensure pod completi…

4cbfcd7

…on is handled

peloyeje requested review from hussein-awala and jedcunningham as code owners July 19, 2024 14:26

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jul 19, 2024

Merge branch 'main' into fix/success-when-pod-fails-while-fetching-logs

74a5fc5

eladkal reviewed Aug 8, 2024

View reviewed changes

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 24, 2024

github-actions bot closed this Oct 1, 2024

romsharon98 mentioned this pull request Oct 8, 2024

Fix mark as success when pod fails while fetching log #42815

Merged

peloyeje deleted the fix/success-when-pod-fails-while-fetching-logs branch October 12, 2024 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: always defer once more after log fetching to ensure pod completion is handled #40891

fix: always defer once more after log fetching to ensure pod completion is handled #40891

Uh oh!

peloyeje commented Jul 19, 2024 •

edited

Loading

Uh oh!

eladkal Aug 8, 2024

Uh oh!

github-actions bot commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: always defer once more after log fetching to ensure pod completion is handled #40891

fix: always defer once more after log fetching to ensure pod completion is handled #40891

Uh oh!

Conversation

peloyeje commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eladkal Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peloyeje commented Jul 19, 2024 •

edited

Loading