-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Clean up pods of fulfilled nodes when manual retry. Fixes #12028 #12105
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,7 @@ rules: | |
- list | ||
- watch | ||
- delete | ||
- patch | ||
- apiGroups: | ||
- "" | ||
resources: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,7 @@ rules: | |
- list | ||
- watch | ||
- delete | ||
- patch | ||
- apiGroups: | ||
- "" | ||
resources: | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,7 @@ import ( | |
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/apimachinery/pkg/labels" | ||
"k8s.io/apimachinery/pkg/selection" | ||
"k8s.io/apimachinery/pkg/types" | ||
|
||
"github.com/argoproj/argo-workflows/v3/persist/sqldb" | ||
workflowarchivepkg "github.com/argoproj/argo-workflows/v3/pkg/apiclient/workflowarchive" | ||
|
@@ -286,7 +287,7 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req | |
_, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Get(ctx, wf.Name, metav1.GetOptions{}) | ||
if apierr.IsNotFound(err) { | ||
|
||
wf, podsToDelete, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters) | ||
wf, podsToDelete, podsToReset, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters) | ||
if err != nil { | ||
return nil, sutils.ToStatusError(err, codes.Internal) | ||
} | ||
|
@@ -299,6 +300,20 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req | |
} | ||
} | ||
|
||
for _, podName := range podsToReset { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was gonna say that this may make sense to place into a helper function, but I see that #7988 (comment) explicitly moved the So this matches existing behavior with |
||
log.WithFields(log.Fields{"podReset": podName}).Info("Resetting pod") | ||
_, err := kubeClient.CoreV1().Pods(wf.Namespace).Patch( | ||
ctx, | ||
podName, | ||
types.MergePatchType, | ||
[]byte(`{"metadata": {"labels": {"workflows.argoproj.io/completed": "false"}}}`), | ||
metav1.PatchOptions{}, | ||
) | ||
if err != nil && !apierr.IsNotFound(err) { | ||
return nil, sutils.ToStatusError(err, codes.Internal) | ||
} | ||
} | ||
|
||
wf.ObjectMeta.ResourceVersion = "" | ||
wf.ObjectMeta.UID = "" | ||
result, err := wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty minor permission given the Server already has
delete
, but I was thinking this logic may make more sense on the Controller actually.When the Controller detects a retry, it can check the Workflow's child Pods.
Need to think a bit more about how that would work, but that would preserve the existing separation of duties between Server and Controller, where the Server is just a simple intermediary for users that can be bypassed with correct RBAC.
The Server primarily reads and listens to changes, and its modifications are limited to signaling the Controller to perform an action (via a label for instance). But the Controller logic is actually responsible to perform the actions themselves (such as retrying)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I see that
podsToDelete
already broke the separation a bit... we honestly may want to refactor that too...@terrytangyuan do you have any thoughts on this? Specifically, the current behavior with the Server having Pod modification logic actually invalidates your answer in #12027 (comment). I think it ideally should behave the way you described there (and how I described above), if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. RBAC modifications is required for my answer in #12027 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I was actually looking for your thoughts on the approach. I think we should refactor this logic to not have the Server perform anything but a label modification, and then the Controller should actually delete, reset, etc child Pods as needed (as that is the Controller's responsibility, not the Server's)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I agree with @agilgur5. Server should not update or delete other than
Argo workflow
CRDs. If some user directly updates the WF spec then this will not work. Controller should handle the scenario to clean the workflow pods based on GCStrategy.I remember @ishitasequeira was proposed to delete all workflow pods "label based" like
workflowname
deletion if the workflow complete.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julie agreed in #12419 (comment) as well, so I filed #12538 as a tracking issue for this refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM