-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(executor): Allow transient error when saving resource parameters #5180
feat(executor): Allow transient error when saving resource parameters #5180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to the tests that verify this?
@alexec Yep, waiting for #5166 to merge and I'll probably do something similar to test an example. Although the transient error part may be non-trivial to test. This pattern is commonly seen throughout the codebase but no e2e test exists yet unless I miss anything. Any suggestions would be appreciated. |
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@@ -3,6 +3,9 @@ set -eu -o pipefail | |||
|
|||
./dist/argo delete -l workflows.argoproj.io/test | |||
|
|||
# Grant admin privileges for the default service account so we could test the examples that submit k8s resources. | |||
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=argo:default -n argo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexec This is only a temporary workaround to test this example in the CI with sufficient permission. Is there any suggested/preferred approach for this going forward?
if exErr, ok := err.(*exec.ExitError); ok { | ||
log.Errorf("`%s` stderr:\n%s", cmd.Args, string(exErr.Stderr)) | ||
var output string | ||
err := waitutil.Backoff(ExecutorRetry, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexec This pattern is commonly seen throughout the codebase. waitutil.Backoff
and argoerr.IsTransientErr
are well-covered in unit tests. However there's no e2e test yet unless I miss anything. The test in examples/k8s-jobs.yaml
exercises part of this code path but not the transient error part. Is this sufficient? Any suggestions would be appreciated.
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Low priority as we only had this problem 3 times. Closing for now. |
We usually have long running tasks such as ML model training as resource type jobs (e.g. Kubeflow training CRDs) and we don't want the executor to fail due to transient errors at the very last phase when saving any of the parameters.
Signed-off-by: terrytangyuan terrytangyuan@gmail.com
Checklist: