-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 262: Configurable Failure Policy API #381
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danielvegamyhre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't read the API yet but seeing a lot of GKE specifics in the examples.
Chatting with @danielvegamyhre in slack - I want to share our discussion here (and perhaps move up a level to a higher level picture): When thinking about success/failure of JobSet, we have several levels that this might operate on:
I do agree that just the global level is not ideal (I think this is a main thrust of this KEP) Likely the original implication of JobSet was to make it easy to wrap jobs together, and assume they fail together, but the use case has expanded beyond that to not want that anymore. As an example, I often use JobSet to run an indexedJob (some HPC app running in parallel) and then I have services that run alongside it, nicely on the same headless service provided by JobSet. If the service fails, I would just want the service to restart, and I wouldn't want my main compute/simulation/ML job to be failed. Within that, I might have logic on the indexed job that says something about what happens if a single index fails. So the "fail the entire JobSet if one component fails" design is not good for that use case. Then the question becomes where to add the logic - does it belong with Job or replicatedJob? I'm assuming we don't want to manage conflicts (the last case) for now. For having the logic on replicatedJob, that is theoretically adding additional context to a Job, so I kind of like that, but if we want the failure/success to work for Job outside of JobSet, then the first bullet makes most sense (and then replicated Job inherits) but that might be a harder change if it needs to go into Kubernetes proper. If we aren't able to OR it's much harder to make a change to there, then likely you'd put this logic on replicatedJob for now. So TLDR: we definitely have use cases for this. The challenging part, for me, is in terms of design - where to allow representing the policies, and how that is managed given conflict. |
Also just a high level observation - I understand ML (and ML jobs) are leading the industry space now, but I think it's dangerous to develop Kubernetes components that are optimized primarily for that. I think the vision here needs to be flexible that the future could (likely will be) different. |
Thanks @vsoch, I see in #381 (comment) that HPC were discussed, do you have feedback on whether or not the current proposal addresses them? Can you help suggest user stories to add to the KEP to evaluate if the proposed approach handles them and if not, discuss adjustments to the API? |
One additional idea I have is to add a If unset, it would default to type FailurePolicy struct {
// MaxRestarts defines the limit on the number of JobSet restarts.
// A restart is achieved by recreating all active child jobs.
MaxRestarts int32 `json:"maxRestarts,omitempty"`
// List of failure policy rules for this JobSet.
// For a given Job failure, the rules will be evaluated in order,
// and only the first matching rule will be executed.
Rules []FailurePolicyRule `json:"rules,omitempty"`
// The default action that is executed for any Job failure that does
// not match any of the failure policy rules.
// If unset, this defaults to "RestartJobSet"
DefaultAction FailurePolicyAction `json:"defaultAction,omitempty"`
} What do you think? |
I wouldn't add such thing. |
Hmm, can you elaborate on this a bit @alculquicondor? To me that seems like a more complicated way of achieving the same thing. |
Perhaps the last rule has empty OnConditionReason, so it matches any Failed condition. It's easier to document because rules apply in order. |
Re DefaultAction: I would keep the api simple. The default action is RestartJobSet, and if the user wants to change that, they can add a catch all rule at the end. |
/label tide/merge-method-squash |
/lgtm holding just in case, feel free to remove the hold Thanks Daniel! |
/hold cancel Thanks for the feedback everyone! |
// fails due to a reason listed in OnJobFailureReasons. | ||
type FailurePolicyRule struct { | ||
// The action to take if the rule is matched. | ||
// +kubebuilder:validation:Enum:=FailJobSet;RestartJobSetAndIgnoreMaxRestarts;FailJob;RestartJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for implementation: remove FailJob;RestartJob from kubebuilder marker. We'll add them back after upstream changes are in place to support these use cases.
Fixes #262