Feature: Configurable Failure Policy API

**What would you like to be added**:
Right now,  the JobSet [FailurePolicy](https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L162C13-L162C13) only allows specifying the max number of JobSet restarts to attempt before marking it as failed. 

Users have requested that we should also allow them to configure the failure conditions under which the JobSet will be restarted, or when the JobSet should be failed immediately without going through the allowed number of restarts (see "Example user stories" section below for why this is needed).

As an initial thought, we could define a field in the JobSet FailurePolicy which allows the user to specify the JobSet should respect the Job's [podFailurePolicy](https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy) (i.e., if a Job failed due to a podFailurePolicy configuring the job to fail immediately without restart under certain conditions, then JobSet should also respect this and not restart the JobSet).

For example, in this list of [common exit codes used by containers](https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/), we can see container exit code 143 is used for containers killed by graceful termination (SIGTERM), which is the signal used by maintenance events via graceful node shutdown, as well as by workload pre-emption. 

A user could configure a `podFailurePolicy` on their job to fail the job immediately if there is any exit code **except** 143. If the JobSet respected this, it **would** restart if the child job was killed by a SIGTERM (maintenance event), but **would not** restart if there an application code error.

**Why is this needed**:

Example use cases / user story:

As a user, when I enqueue and run many JobSet workloads using Kueue, I'd like to be able to specify the JobSet failure policy to restart if the containers exited with a SIGTERM exit code (e.g. maintenance event or pre-emption), but NOT restart if the container exited with an application code error, in order to not consume valuable resources repeatedly attempting to run pods that are doomed to fail, while other workloads sit idle in Kueue awaiting free resources to run. 

**Example API**:

```go
type FailurePolicy struct {
	// TargetPodFailurePolicies, if set, will cause the JobSet to fail immediately 
        // without restart if any of its failed child jobs failed due to matching a podFailurePolicy.
	FollowPodFailurePolicy *bool`json:"followPodFailurePolicy,omitempty"`

	// MaxRestarts defines the limit on the number of JobSet restarts.
	// A restart is achieved by recreating all active child jobs.
	MaxRestarts int32 `json:"maxRestarts,omitempty"`
}
```
Example pod failure policy which the user could use in combination with `.spec.failurePolicy.followPodFailurePolicy = true` to allow jobsets to be restarted if the jobs were killed due to maintenance events, but not if the jobs failed due to application code errors or other bugs:

```yaml
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: user-container
        operator: NotIn
        values: [143] # SIGTERM
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Configurable Failure Policy API #262

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development