Skip to content

Feature: Configurable Failure Policy API #262

Closed
@danielvegamyhre

Description

What would you like to be added:
Right now, the JobSet FailurePolicy only allows specifying the max number of JobSet restarts to attempt before marking it as failed.

Users have requested that we should also allow them to configure the failure conditions under which the JobSet will be restarted, or when the JobSet should be failed immediately without going through the allowed number of restarts (see "Example user stories" section below for why this is needed).

As an initial thought, we could define a field in the JobSet FailurePolicy which allows the user to specify the JobSet should respect the Job's podFailurePolicy (i.e., if a Job failed due to a podFailurePolicy configuring the job to fail immediately without restart under certain conditions, then JobSet should also respect this and not restart the JobSet).

For example, in this list of common exit codes used by containers, we can see container exit code 143 is used for containers killed by graceful termination (SIGTERM), which is the signal used by maintenance events via graceful node shutdown, as well as by workload pre-emption.

A user could configure a podFailurePolicy on their job to fail the job immediately if there is any exit code except 143. If the JobSet respected this, it would restart if the child job was killed by a SIGTERM (maintenance event), but would not restart if there an application code error.

Why is this needed:

Example use cases / user story:

As a user, when I enqueue and run many JobSet workloads using Kueue, I'd like to be able to specify the JobSet failure policy to restart if the containers exited with a SIGTERM exit code (e.g. maintenance event or pre-emption), but NOT restart if the container exited with an application code error, in order to not consume valuable resources repeatedly attempting to run pods that are doomed to fail, while other workloads sit idle in Kueue awaiting free resources to run.

Example API:

type FailurePolicy struct {
	// TargetPodFailurePolicies, if set, will cause the JobSet to fail immediately 
        // without restart if any of its failed child jobs failed due to matching a podFailurePolicy.
	FollowPodFailurePolicy *bool`json:"followPodFailurePolicy,omitempty"`

	// MaxRestarts defines the limit on the number of JobSet restarts.
	// A restart is achieved by recreating all active child jobs.
	MaxRestarts int32 `json:"maxRestarts,omitempty"`
}

Example pod failure policy which the user could use in combination with .spec.failurePolicy.followPodFailurePolicy = true to allow jobsets to be restarted if the jobs were killed due to maintenance events, but not if the jobs failed due to application code errors or other bugs:

  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: user-container
        operator: NotIn
        values: [143] # SIGTERM

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions