Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 262: Configurable Failure Policy API #381

Merged
merged 16 commits into from
Feb 10, 2024

Conversation

danielvegamyhre
Copy link
Contributor

Fixes #262

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 19, 2024
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't read the API yet but seeing a lot of GKE specifics in the examples.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 19, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 19, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 24, 2024
@vsoch
Copy link
Contributor

vsoch commented Jan 24, 2024

Chatting with @danielvegamyhre in slack - I want to share our discussion here (and perhaps move up a level to a higher level picture):

When thinking about success/failure of JobSet, we have several levels that this might operate on:

  • lowest level only we have the failure / success policy defined on the level of Job (and indexed job, for example) so then that is honored, and replicatedJob is just a shell to put that in (and honor those policies), and that trickles up to replicatedJob->JobSet.
  • "global": if something in the JobSet fails, we have to start everything again (this doesn't seem ideal to me) but certainly was an OK starting approach, before the full set of use cases was known for JobSet.
  • "middle level": if something fails in a replicatedJob, take this action.
  • "multiple level": any of the above, with needing to resolve conflicting preferences (probably not ideal)

I do agree that just the global level is not ideal (I think this is a main thrust of this KEP) Likely the original implication of JobSet was to make it easy to wrap jobs together, and assume they fail together, but the use case has expanded beyond that to not want that anymore. As an example, I often use JobSet to run an indexedJob (some HPC app running in parallel) and then I have services that run alongside it, nicely on the same headless service provided by JobSet. If the service fails, I would just want the service to restart, and I wouldn't want my main compute/simulation/ML job to be failed. Within that, I might have logic on the indexed job that says something about what happens if a single index fails.

So the "fail the entire JobSet if one component fails" design is not good for that use case. Then the question becomes where to add the logic - does it belong with Job or replicatedJob? I'm assuming we don't want to manage conflicts (the last case) for now.

For having the logic on replicatedJob, that is theoretically adding additional context to a Job, so I kind of like that, but if we want the failure/success to work for Job outside of JobSet, then the first bullet makes most sense (and then replicated Job inherits) but that might be a harder change if it needs to go into Kubernetes proper. If we aren't able to OR it's much harder to make a change to there, then likely you'd put this logic on replicatedJob for now.

So TLDR: we definitely have use cases for this. The challenging part, for me, is in terms of design - where to allow representing the policies, and how that is managed given conflict.

@vsoch
Copy link
Contributor

vsoch commented Jan 24, 2024

Also just a high level observation - I understand ML (and ML jobs) are leading the industry space now, but I think it's dangerous to develop Kubernetes components that are optimized primarily for that. I think the vision here needs to be flexible that the future could (likely will be) different.

@ahg-g
Copy link
Contributor

ahg-g commented Jan 24, 2024

Thanks @vsoch, I see in #381 (comment) that HPC were discussed, do you have feedback on whether or not the current proposal addresses them? Can you help suggest user stories to add to the KEP to evaluate if the proposed approach handles them and if not, discuss adjustments to the API?

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Feb 9, 2024

One additional idea I have is to add a DefaultAction field to the failure policy, to make the default behavior explicit instead of implicit. I think it would make the default restart/failure recovery behavior more explicitly clear in the API, rather than the user needing to read the docs/comments to understand what happens if they don't set one.

If unset, it would default to RestartJobSet so our current default behavior is preserved.

type FailurePolicy struct {
  // MaxRestarts defines the limit on the number of JobSet restarts.
  // A restart is achieved by recreating all active child jobs.
  MaxRestarts int32 `json:"maxRestarts,omitempty"`
  // List of failure policy rules for this JobSet.
  // For a given Job failure, the rules will be evaluated in order,
  // and only the first matching rule will be executed.
  Rules []FailurePolicyRule `json:"rules,omitempty"`
  // The default action that is executed for any Job failure that does
  // not match any of the failure policy rules.
  // If unset, this defaults to "RestartJobSet"
  DefaultAction FailurePolicyAction `json:"defaultAction,omitempty"`
}

What do you think?

@alculquicondor
Copy link

I wouldn't add such thing.
It might make more sense for the user to add a Rule at the end that kind of matches anything

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Feb 9, 2024

I wouldn't add such thing. It might make more sense for the user to add a Rule at the end that kind of matches anything

Hmm, can you elaborate on this a bit @alculquicondor? To me that seems like a more complicated way of achieving the same thing.

@alculquicondor
Copy link

Perhaps the last rule has empty OnConditionReason, so it matches any Failed condition.

It's easier to document because rules apply in order.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 9, 2024

Re DefaultAction: I would keep the api simple. The default action is RestartJobSet, and if the user wants to change that, they can add a catch all rule at the end.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 9, 2024

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 9, 2024
@ahg-g
Copy link
Contributor

ahg-g commented Feb 9, 2024

/lgtm
/hold

holding just in case, feel free to remove the hold

Thanks Daniel!

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2024
@danielvegamyhre
Copy link
Contributor Author

/hold cancel

Thanks for the feedback everyone!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 224bc02 into kubernetes-sigs:main Feb 10, 2024
9 checks passed
// fails due to a reason listed in OnJobFailureReasons.
type FailurePolicyRule struct {
// The action to take if the rule is matched.
// +kubebuilder:validation:Enum:=FailJobSet;RestartJobSetAndIgnoreMaxRestarts;FailJob;RestartJob
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for implementation: remove FailJob;RestartJob from kubebuilder marker. We'll add them back after upstream changes are in place to support these use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Configurable Failure Policy API
7 participants