fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105

jswxstw · 2023-10-30T09:41:06Z

Fixes #12028

Motivation

PodGC strategy: OnWorkflowSuccess.
Pods of the previously successful steps were not cleaned up after workflow manual retry.

Modifications

Pods of fulfilled nodes will be relabeled completed=false when workflow manual retry.

Verification

Workflow Demo

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-fail-with-argument-
spec:
  entrypoint: main
  podGC:
    strategy: OnWorkflowSuccess
  arguments:
    parameters:
      - name: code
        value: 1
  templates:
  - name: main
    steps:
    - - name: print
        template: print-with-argument
        arguments:
          parameters:
            - name: code
              value: '{{workflow.parameters.code}}'
    - - name: check-status
        template: fail-with-argument
        arguments:
          parameters:
            - name: code
              value: '{{workflow.parameters.code}}'
  - name: print-with-argument
    inputs:
      parameters:
        - name: code
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["echo {{inputs.parameters.code}}"]
  - name: fail-with-argument
    inputs:
      parameters:
        - name: code
    container:
      image: python:alpine3.6
      command: [python, -c]
      args: ["import sys; sys.exit({{inputs.parameters.code}})"]

Workflow executed failed first time.

Retry the failed workflow manually, then workflow executes successfully and changes to succeeded phase.

All pods were successfully cleaned up.

argoproj#12028 Signed-off-by: oninowang <oninowang@tencent.com>

Signed-off-by: oninowang <oninowang@tencent.com>

agilgur5 · 2023-10-30T15:19:03Z

manifests/cluster-install/argo-server-rbac/argo-server-clusterole.yaml

@@ -29,6 +29,7 @@ rules:
      - list
      - watch
      - delete
+      - patch


this is a pretty minor permission given the Server already has delete, but I was thinking this logic may make more sense on the Controller actually.

When the Controller detects a retry, it can check the Workflow's child Pods.

Need to think a bit more about how that would work, but that would preserve the existing separation of duties between Server and Controller, where the Server is just a simple intermediary for users that can be bypassed with correct RBAC.
The Server primarily reads and listens to changes, and its modifications are limited to signaling the Controller to perform an action (via a label for instance). But the Controller logic is actually responsible to perform the actions themselves (such as retrying)

Hmmm I see that podsToDelete already broke the separation a bit... we honestly may want to refactor that too...

@terrytangyuan do you have any thoughts on this? Specifically, the current behavior with the Server having Pod modification logic actually invalidates your answer in #12027 (comment). I think it ideally should behave the way you described there (and how I described above), if possible.

You are right. RBAC modifications is required for my answer in #12027 (comment)

@terrytangyuan I was actually looking for your thoughts on the approach. I think we should refactor this logic to not have the Server perform anything but a label modification, and then the Controller should actually delete, reset, etc child Pods as needed (as that is the Controller's responsibility, not the Server's)

@terrytangyuan I agree with @agilgur5. Server should not update or delete other than Argo workflow CRDs. If some user directly updates the WF spec then this will not work. Controller should handle the scenario to clean the workflow pods based on GCStrategy.
I remember @ishitasequeira was proposed to delete all workflow pods "label based" like workflowname deletion if the workflow complete.

Julie agreed in #12419 (comment) as well, so I filed #12538 as a tracking issue for this refactor

agilgur5 · 2023-10-30T15:35:28Z

server/workflowarchive/archived_workflow_server.go

@@ -299,6 +300,20 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
 			}
 		}

+		for _, podName := range podsToReset {


I was gonna say that this may make sense to place into a helper function, but I see that #7988 (comment) explicitly moved the kubeClient logic out of the helper function (when retry was introduced for archived workflows in #7988).

So this matches existing behavior with podsToDelete above. If we refactored these into the Controller (per above comment), this would go away though 🤔

agilgur5 · 2023-10-30T15:57:11Z

Thanks for taking the time to investigate and fix this!

Per the in-line comments, we may want to refactor some existing logic that this PR highlights as potentially having broken the Controller/Server separation of responsibilities.

I think the current code more or less makes sense within that existing context (the re-processing checks in the Controller are a bit confusing, though that's partially due to the existing logic. retries are also one of the most complex parts of the codebase), but we may want to change that existing code.

jswxstw · 2023-10-31T02:44:39Z

Retries(both auto or manual) are indeed the most complex parts of the codebase, which bring a lot of problems like #12009, #12010.
Moving the implementation into Controller really make sense, I'm glad to see this improvement.

fix: Clean up pods of fulfilled nodes when workflow manual retry. Fixes

1f891ac

argoproj#12028 Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw force-pushed the fix-12028 branch from f86f8a8 to 1f891ac Compare October 30, 2023 11:45

fix: Argo server requires pod patch permissions. Fixes argoproj#12028

e41c46d

Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw closed this Oct 30, 2023

jswxstw reopened this Oct 30, 2023

agilgur5 added area/api Argo Server API area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server and removed area/api Argo Server API labels Oct 30, 2023

agilgur5 reviewed Oct 30, 2023

View reviewed changes

This was referenced Jan 17, 2024

feat: delete pods in parallel to speed up retryworkflow #12419

Merged

Move retry Pod deletions out of Server and into Controller for proper separation of duties #12538

Open

agilgur5 mentioned this pull request Feb 2, 2024

REQUEST: Promotion to Approver for @agilgur5 argoproj/argoproj#277

Closed

7 tasks

Joibel mentioned this pull request Oct 17, 2024

fix: correct retry logic #13734

Open

agilgur5 changed the title ~~fix: Clean up pods of fulfilled nodes when workflow manual retry. Fix…~~ fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105

fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105

jswxstw commented Oct 30, 2023

agilgur5 Oct 30, 2023 •

edited

Loading

agilgur5 Oct 30, 2023 •

edited

Loading

terrytangyuan Oct 31, 2023

agilgur5 Oct 31, 2023

sarabala1979 Nov 27, 2023 •

edited

Loading

agilgur5 Jan 17, 2024

terrytangyuan Feb 18, 2024

agilgur5 Oct 30, 2023 •

edited

Loading

agilgur5 commented Oct 30, 2023 •

edited

Loading

jswxstw commented Oct 31, 2023

@@ @@ -29,6 +29,7 @@ rules: @@
                     - list
                     - watch
                     - delete
+                    - patch

fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105

Are you sure you want to change the base?

fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105

Conversation

jswxstw commented Oct 30, 2023

Motivation

Modifications

Verification

agilgur5 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

agilgur5 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

terrytangyuan Oct 31, 2023

Choose a reason for hiding this comment

agilgur5 Oct 31, 2023

Choose a reason for hiding this comment

sarabala1979 Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

agilgur5 Jan 17, 2024

Choose a reason for hiding this comment

terrytangyuan Feb 18, 2024

Choose a reason for hiding this comment

agilgur5 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

agilgur5 commented Oct 30, 2023 • edited Loading

jswxstw commented Oct 31, 2023

agilgur5 Oct 30, 2023 •

edited

Loading

agilgur5 Oct 30, 2023 •

edited

Loading

sarabala1979 Nov 27, 2023 •

edited

Loading

agilgur5 Oct 30, 2023 •

edited

Loading

agilgur5 commented Oct 30, 2023 •

edited

Loading