Skip to content

Commit

Permalink
feat: Make retryPolicy saner in the presence of an expression (argopr…
Browse files Browse the repository at this point in the history
…oj#11005)

Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Saravanan Balasubramanian <33908564+sarabala1979@users.noreply.github.com>
  • Loading branch information
Joibel and sarabala1979 authored May 22, 2023
1 parent 1f6d1ba commit 310bb5a
Show file tree
Hide file tree
Showing 4 changed files with 510 additions and 115 deletions.
37 changes: 34 additions & 3 deletions docs/retries.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,45 @@ spec:
args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
```
The `retryPolicy` and `expression` are re-evaluated after each attempt. For example, if you set `retryPolicy: OnFailure` and your first attempt produces a failure then a retry will be attempted. If the second attempt produces an error, then another attempt will not be made.

## Retry policies

Use `retryPolicy` to choose which failures to retry:
Use `retryPolicy` to choose which failure types to retry:

- `Always`: Retry all failed steps
- `OnFailure`: Retry steps whose main container is marked as failed in Kubernetes (this is the default)
- `OnFailure`: Retry steps whose main container is marked as failed in Kubernetes
- `OnError`: Retry steps that encounter Argo controller errors, or whose init or wait containers fail
- `OnTransientError`: Retry steps that encounter errors [defined as transient](https://github.com/argoproj/argo-workflows/blob/master/util/errors/errors.go), or errors matching the `TRANSIENT_ERROR_PATTERN` [environment variable](https://argoproj.github.io/argo-workflows/environment-variables/). Available in version 3.0 and later.

For example:
The `retryPolicy` applies even if you also specify an `expression`, but in version 3.5 or later the default policy means the expression makes the decision unless you explicitly specify a policy.

The default `retryPolicy` is `OnFailure`, except in version 3.5 or later when an expression is also supplied, when it is `Always`. This may be easier to understand in this diagram.

```mermaid
flowchart LR
start([Will a retry be attempted])
start --> policy
policy(Policy Specified?)
policy-->|No|expressionNoPolicy
policy-->|Yes|policyGiven
policyGiven(Expression Specified?)
policyGiven-->|No|policyGivenApplies
policyGiven-->|Yes|policyAndExpression
policyGivenApplies(Supplied Policy)
policyAndExpression(Supplied Policy AND Expression)
expressionNoPolicy(Expression specified?)
expressionNoPolicy-->|No|onfailureNoExpr
expressionNoPolicy-->|Yes|version
onfailureNoExpr[OnFailure]
onfailure[OnFailure AND Expression]
version(Workflows version)
version-->|3.4 or ealier|onfailure
always[Only Expression matters]
version-->|3.5 or later|always
```

An example retry strategy:

```yaml
apiVersion: argoproj.io/v1alpha1
Expand Down Expand Up @@ -67,6 +96,8 @@ access to the following variables:

If `expression` evaluates to false, the step will not be retried.

The `expression` result will be logical *and* with the `retryPolicy`. Both must be true to retry.

See [example](https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/retry-conditional.yaml) for usage.

## Back-Off
Expand Down
16 changes: 16 additions & 0 deletions pkg/apis/workflow/v1alpha1/workflow_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -1941,6 +1941,22 @@ type RetryStrategy struct {
Expression string `json:"expression,omitempty" protobuf:"bytes,5,opt,name=expression"`
}

// RetryPolicyActual gets the active retry policy for a strategy.
// If the policy is explicit, use that.
// If an expression is given, use a policy of Always so the
// expression is all that controls the retry for 'least surprise'.
// Otherwise, if neither is given, default to retry OnFailure.
func (s RetryStrategy) RetryPolicyActual() RetryPolicy {
if s.RetryPolicy != "" {
return s.RetryPolicy
}
if s.Expression == "" {
return RetryPolicyOnFailure
} else {
return RetryPolicyAlways
}
}

// The amount of requested resource * the duration that request was used.
// This is represented as duration in seconds, so can be converted to and from
// duration (with loss of precision).
Expand Down
23 changes: 16 additions & 7 deletions workflow/controller/operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -1010,7 +1010,7 @@ func (woc *wfOperationCtx) processNodeRetries(node *wfv1.NodeStatus, retryStrate

var retryOnFailed bool
var retryOnError bool
switch retryStrategy.RetryPolicy {
switch retryStrategy.RetryPolicyActual() {
case wfv1.RetryPolicyAlways:
retryOnFailed = true
retryOnError = true
Expand All @@ -1022,12 +1022,13 @@ func (woc *wfOperationCtx) processNodeRetries(node *wfv1.NodeStatus, retryStrate
retryOnFailed = true
retryOnError = true
}
case wfv1.RetryPolicyOnFailure, "":
case wfv1.RetryPolicyOnFailure:
retryOnFailed = true
retryOnError = false
default:
return nil, false, fmt.Errorf("%s is not a valid RetryPolicy", retryStrategy.RetryPolicy)
return nil, false, fmt.Errorf("%s is not a valid RetryPolicy", retryStrategy.RetryPolicyActual())
}
woc.log.Infof("Retry Policy: %s (onFailed: %v, onError %v)", retryStrategy.RetryPolicyActual(), retryOnFailed, retryOnError)

if (lastChildNode.Phase == wfv1.NodeFailed && !retryOnFailed) || (lastChildNode.Phase == wfv1.NodeError && !retryOnError) {
woc.log.Infof("Node not set to be retried after status: %s", lastChildNode.Phase)
Expand Down Expand Up @@ -2022,10 +2023,18 @@ func (woc *wfOperationCtx) executeTemplate(ctx context.Context, nodeName string,
// If retry policy is not set, or if it is not set to Always or OnError, we won't attempt to retry an errored container
// and we return instead.
retryStrategy := woc.retryStrategy(processedTmpl)
if retryStrategy == nil ||
(retryStrategy.RetryPolicy != wfv1.RetryPolicyAlways &&
retryStrategy.RetryPolicy != wfv1.RetryPolicyOnError &&
retryStrategy.RetryPolicy != wfv1.RetryPolicyOnTransientError) {
release := false
if retryStrategy == nil {
release = true
} else {
retryPolicy := retryStrategy.RetryPolicyActual()
if retryPolicy != wfv1.RetryPolicyAlways &&
retryPolicy != wfv1.RetryPolicyOnError &&
retryPolicy != wfv1.RetryPolicyOnTransientError {
release = true
}
}
if release {
woc.controller.syncManager.Release(woc.wf, node.ID, processedTmpl.Synchronization)
return node, err
}
Expand Down
Loading

0 comments on commit 310bb5a

Please sign in to comment.