KEP 262: Configurable Failure Policy API #381

danielvegamyhre · 2024-01-19T18:57:19Z

Fixes #262

k8s-ci-robot · 2024-01-19T18:57:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kannon92

Haven't read the API yet but seeing a lot of GKE specifics in the examples.

keps/262-ConfigurableFailurePolicy/README.md

keps/262-ConfigurableFailurePolicy/kep.yaml

keps/262-ConfigurableFailurePolicy/README.md

keps/262-ConfigurableFailurePolicy/kep.yaml

examples/simple/configurable-failure-policy.yaml

keps/262-ConfigurableFailurePolicy/README.md

vsoch · 2024-01-24T21:48:40Z

Chatting with @danielvegamyhre in slack - I want to share our discussion here (and perhaps move up a level to a higher level picture):

When thinking about success/failure of JobSet, we have several levels that this might operate on:

lowest level only we have the failure / success policy defined on the level of Job (and indexed job, for example) so then that is honored, and replicatedJob is just a shell to put that in (and honor those policies), and that trickles up to replicatedJob->JobSet.
"global": if something in the JobSet fails, we have to start everything again (this doesn't seem ideal to me) but certainly was an OK starting approach, before the full set of use cases was known for JobSet.
"middle level": if something fails in a replicatedJob, take this action.
"multiple level": any of the above, with needing to resolve conflicting preferences (probably not ideal)

I do agree that just the global level is not ideal (I think this is a main thrust of this KEP) Likely the original implication of JobSet was to make it easy to wrap jobs together, and assume they fail together, but the use case has expanded beyond that to not want that anymore. As an example, I often use JobSet to run an indexedJob (some HPC app running in parallel) and then I have services that run alongside it, nicely on the same headless service provided by JobSet. If the service fails, I would just want the service to restart, and I wouldn't want my main compute/simulation/ML job to be failed. Within that, I might have logic on the indexed job that says something about what happens if a single index fails.

So the "fail the entire JobSet if one component fails" design is not good for that use case. Then the question becomes where to add the logic - does it belong with Job or replicatedJob? I'm assuming we don't want to manage conflicts (the last case) for now.

For having the logic on replicatedJob, that is theoretically adding additional context to a Job, so I kind of like that, but if we want the failure/success to work for Job outside of JobSet, then the first bullet makes most sense (and then replicated Job inherits) but that might be a harder change if it needs to go into Kubernetes proper. If we aren't able to OR it's much harder to make a change to there, then likely you'd put this logic on replicatedJob for now.

So TLDR: we definitely have use cases for this. The challenging part, for me, is in terms of design - where to allow representing the policies, and how that is managed given conflict.

vsoch · 2024-01-24T21:57:01Z

Also just a high level observation - I understand ML (and ML jobs) are leading the industry space now, but I think it's dangerous to develop Kubernetes components that are optimized primarily for that. I think the vision here needs to be flexible that the future could (likely will be) different.

ahg-g · 2024-01-24T22:55:12Z

Thanks @vsoch, I see in #381 (comment) that HPC were discussed, do you have feedback on whether or not the current proposal addresses them? Can you help suggest user stories to add to the KEP to evaluate if the proposed approach handles them and if not, discuss adjustments to the API?

keps/262-ConfigurableFailurePolicy/README.md

keps/262-ConfigurableFailurePolicy/kep.yaml

danielvegamyhre · 2024-02-09T21:23:16Z

One additional idea I have is to add a DefaultAction field to the failure policy, to make the default behavior explicit instead of implicit. I think it would make the default restart/failure recovery behavior more explicitly clear in the API, rather than the user needing to read the docs/comments to understand what happens if they don't set one.

If unset, it would default to RestartJobSet so our current default behavior is preserved.

type FailurePolicy struct {
  // MaxRestarts defines the limit on the number of JobSet restarts.
  // A restart is achieved by recreating all active child jobs.
  MaxRestarts int32 `json:"maxRestarts,omitempty"`
  // List of failure policy rules for this JobSet.
  // For a given Job failure, the rules will be evaluated in order,
  // and only the first matching rule will be executed.
  Rules []FailurePolicyRule `json:"rules,omitempty"`
  // The default action that is executed for any Job failure that does
  // not match any of the failure policy rules.
  // If unset, this defaults to "RestartJobSet"
  DefaultAction FailurePolicyAction `json:"defaultAction,omitempty"`
}

What do you think?

alculquicondor · 2024-02-09T21:38:11Z

I wouldn't add such thing.
It might make more sense for the user to add a Rule at the end that kind of matches anything

keps/262-ConfigurableFailurePolicy/README.md

danielvegamyhre · 2024-02-09T21:48:34Z

I wouldn't add such thing. It might make more sense for the user to add a Rule at the end that kind of matches anything

Hmm, can you elaborate on this a bit @alculquicondor? To me that seems like a more complicated way of achieving the same thing.

alculquicondor · 2024-02-09T21:58:44Z

Perhaps the last rule has empty OnConditionReason, so it matches any Failed condition.

It's easier to document because rules apply in order.

ahg-g · 2024-02-09T23:03:35Z

Re DefaultAction: I would keep the api simple. The default action is RestartJobSet, and if the user wants to change that, they can add a catch all rule at the end.

keps/262-ConfigurableFailurePolicy/README.md

ahg-g · 2024-02-09T23:31:53Z

/label tide/merge-method-squash

ahg-g · 2024-02-09T23:39:13Z

/lgtm
/hold

holding just in case, feel free to remove the hold

Thanks Daniel!

danielvegamyhre · 2024-02-10T00:01:40Z

/hold cancel

Thanks for the feedback everyone!

danielvegamyhre · 2024-03-15T19:57:49Z

keps/262-ConfigurableFailurePolicy/README.md

+// fails due to a reason listed in OnJobFailureReasons.
+type FailurePolicyRule struct {
+  // The action to take if the rule is matched.
+  // +kubebuilder:validation:Enum:=FailJobSet;RestartJobSetAndIgnoreMaxRestarts;FailJob;RestartJob


note for implementation: remove FailJob;RestartJob from kubebuilder marker. We'll add them back after upstream changes are in place to support these use cases.

k8s-ci-robot requested review from ahg-g and kannon92 January 19, 2024 18:57

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 19, 2024