-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105
base: main
Are you sure you want to change the base?
Conversation
argoproj#12028 Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: oninowang <oninowang@tencent.com>
@@ -29,6 +29,7 @@ rules: | |||
- list | |||
- watch | |||
- delete | |||
- patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty minor permission given the Server already has delete
, but I was thinking this logic may make more sense on the Controller actually.
When the Controller detects a retry, it can check the Workflow's child Pods.
Need to think a bit more about how that would work, but that would preserve the existing separation of duties between Server and Controller, where the Server is just a simple intermediary for users that can be bypassed with correct RBAC.
The Server primarily reads and listens to changes, and its modifications are limited to signaling the Controller to perform an action (via a label for instance). But the Controller logic is actually responsible to perform the actions themselves (such as retrying)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I see that podsToDelete
already broke the separation a bit... we honestly may want to refactor that too...
@terrytangyuan do you have any thoughts on this? Specifically, the current behavior with the Server having Pod modification logic actually invalidates your answer in #12027 (comment). I think it ideally should behave the way you described there (and how I described above), if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. RBAC modifications is required for my answer in #12027 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I was actually looking for your thoughts on the approach. I think we should refactor this logic to not have the Server perform anything but a label modification, and then the Controller should actually delete, reset, etc child Pods as needed (as that is the Controller's responsibility, not the Server's)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I agree with @agilgur5. Server should not update or delete other than Argo workflow
CRDs. If some user directly updates the WF spec then this will not work. Controller should handle the scenario to clean the workflow pods based on GCStrategy.
I remember @ishitasequeira was proposed to delete all workflow pods "label based" like workflowname
deletion if the workflow complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julie agreed in #12419 (comment) as well, so I filed #12538 as a tracking issue for this refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
@@ -299,6 +300,20 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req | |||
} | |||
} | |||
|
|||
for _, podName := range podsToReset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was gonna say that this may make sense to place into a helper function, but I see that #7988 (comment) explicitly moved the kubeClient
logic out of the helper function (when retry was introduced for archived workflows in #7988).
So this matches existing behavior with podsToDelete
above. If we refactored these into the Controller (per above comment), this would go away though 🤔
Thanks for taking the time to investigate and fix this! Per the in-line comments, we may want to refactor some existing logic that this PR highlights as potentially having broken the Controller/Server separation of responsibilities. I think the current code more or less makes sense within that existing context (the re-processing checks in the Controller are a bit confusing, though that's partially due to the existing logic. retries are also one of the most complex parts of the codebase), but we may want to change that existing code. |
Fixes #12028
Motivation
PodGC strategy: OnWorkflowSuccess.
Pods of the previously successful steps were not cleaned up after workflow manual retry.
Modifications
Pods of fulfilled nodes will be relabeled completed=false when workflow manual retry.
Verification
Workflow Demo