fix: pod finalizer removal and odd pod status #14088

Joibel · 2025-01-15T15:54:11Z

Motivation

Finalizers

If pod finalizers are in use they should not prevent pod deletion after the pod is complete.

For example: If you have a podGC.strategy of OnPodSuccess with a deleteDelayDuration set and you delete the owning Workflow during the deleteDelayDuration then the pod will remain until deleteDelayDuration expires. If the workflow-controller is restarted during this window the pod is orphaned, with the finalizer still in place.

blockOwnerDeletion in the ownerReference of a pod does not prevent the owner (Workflow) being deleted in all circumstances

Wait Running whilst Pod Failed

It is possible for a node to disappear from a cluster as a surprise. In this case the Pod ContainerStatus could remain in running (because the container never went into any further state), whilst the Pod's own Status is in Error. We have seen this in real clusters, but it is rare.

This PR attempts to recognise this case and set the Workflow Node status accordingly.

Modifications

When a pod has a finalizer on it and the workflow node on it is Fulfilled, we don't need it any more, so always remove our finalizer on it if present. This will allow the workflow to get deleted independently and for ownerReference deletion to propagate and delete the pod. It also takes care of some race conditions and the event that the only reference to a completed pod is in the delayed cleanup queue, which is not persistent across restarts.

When a pod's status is Failed always mark the workflow nodes on it as Failed. Previously you could get leaving phase un-changed: wait container is not yet terminated log messages, and there is no path out of this state. Allow a path out of this state, and added a unit test to show it works. Also allow acknowledge this state when reconciling ContainerSets.

Verification

Added unit tests

Ran this in production with pod finalizers on and a PodGC strategy enabled. Without this change (vanilla 3.6.2) this would result in pods stuck in Terminating on a reasonably regular basis with the finalizer still on them. This has not happened with this change.

Signed-off-by: Alan Clucas <alan@clucas.org>

isubasinghe · 2025-01-16T06:38:31Z

/retest

isubasinghe

This makes sense, I'm not entirely confident every edge case is accounted for but lets ship it for now.

Joibel · 2025-01-16T11:03:42Z

@shuangkun and @jswxstw, I'd like to hear your thoughts on these changes if you have time.

Joibel · 2025-01-16T11:19:46Z

I realise now that #13491 does something similar (I originally wrote this patch against the 3.5 codebase).

In the scenario we're seeing with a non-evicted pod, the workflow node still gets stuck with #13491 in place.

shuangkun · 2025-01-16T15:55:32Z

workflow/controller/pod_cleanup.go

+		switch determinePodCleanupAction(selector, pod.Labels, strategy, workflowPhase, pod.Status.Phase, pod.Finalizers) {
 		case deletePod:
 			woc.controller.queuePodForCleanupAfter(pod.Namespace, pod.Name, deletePod, delay)
+		case removeFinalizer:


Reasonable, is it need to add

wfc.queuePodForCleanup(p.Namespace, p.Name, removeFinalizer)

here？

Yeah, oops. I had thought the final parameter was action. Will fix.

Nice catch @shuangkun, I missed this in the review.

Maybe change this section to this:

action := determinePodCleanupAction(selector, pod.Labels, strategy, workflowPhase, pod.Status.Phase, pod.Finalizers) if action == deletePod { woc.controller.queuePodForCleanupAfter(pod.Namespace, pod.Name, action, delay) } else { woc.controller.queuePodForCleanup(pod.Namespace, pod.Name, action) }

jswxstw · 2025-01-17T08:39:03Z

It is possible for a node to disappear from a cluster as a surprise. In this case the Pod ContainerStatus could remain in running (because the container never went into any further state), whilst the Pod's own Status is in Error. We have seen this in real clusters, but it is rare.

Indeed, this issue kubernetes/kubernetes#98718 in older versions of Kubernetes can also cause inconsistencies between pod status and container status, which can lead to the workflow getting stuck.

Joibel · 2025-02-06T09:22:10Z

Replaced by #14129

fix: pod finalizer removal and odd pod status

88fcdf3

Signed-off-by: Alan Clucas <alan@clucas.org>

Joibel requested a review from isubasinghe January 15, 2025 15:54

Joibel added the area/controller Controller issues, panics label Jan 15, 2025

chore: add a unit test for operator change

e2905a2

Signed-off-by: Alan Clucas <alan@clucas.org>

isubasinghe approved these changes Jan 16, 2025

View reviewed changes

Joibel marked this pull request as ready for review January 16, 2025 11:03

Joibel requested review from jswxstw and shuangkun January 16, 2025 11:03

shuangkun reviewed Jan 16, 2025

View reviewed changes

Joibel mentioned this pull request Jan 27, 2025

fix: split pod controller from workflow controller #14129

Merged

shuangkun mentioned this pull request Feb 5, 2025

REQUEST: Promotion to Approver for shuangkun argoproj/argoproj#338

Closed

6 tasks

Joibel closed this Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: pod finalizer removal and odd pod status #14088

fix: pod finalizer removal and odd pod status #14088

Uh oh!

Joibel commented Jan 15, 2025 •

edited

Loading

Uh oh!

isubasinghe commented Jan 16, 2025

Uh oh!

isubasinghe left a comment

Uh oh!

Joibel commented Jan 16, 2025

Uh oh!

Joibel commented Jan 16, 2025

Uh oh!

shuangkun Jan 16, 2025

Uh oh!

Joibel Jan 16, 2025

Uh oh!

isubasinghe Jan 16, 2025

Uh oh!

isubasinghe Jan 16, 2025

Uh oh!

jswxstw commented Jan 17, 2025

Uh oh!

Joibel commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: pod finalizer removal and odd pod status #14088

fix: pod finalizer removal and odd pod status #14088

Uh oh!

Conversation

Joibel commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Finalizers

Wait Running whilst Pod Failed

Modifications

Verification

Uh oh!

isubasinghe commented Jan 16, 2025

Uh oh!

isubasinghe left a comment

Choose a reason for hiding this comment

Uh oh!

Joibel commented Jan 16, 2025

Uh oh!

Joibel commented Jan 16, 2025

Uh oh!

shuangkun Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

Joibel Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

isubasinghe Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

isubasinghe Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

jswxstw commented Jan 17, 2025

Uh oh!

Joibel commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Joibel commented Jan 15, 2025 •

edited

Loading