Skip to content

Commit

Permalink
add another user story for the simplest use case
Browse files Browse the repository at this point in the history
  • Loading branch information
danielvegamyhre committed Feb 6, 2024
1 parent 9671531 commit ac89cc7
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 10 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,13 @@ tags, and then generate with `hack/update-toc.sh`.
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Defaulting](#defaulting)
- [Validation](#validation)
- [Business logic](#business-logic)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
Expand Down Expand Up @@ -199,6 +203,53 @@ bogged down.
-->

#### Story 1
As a user, I am using a JobSet to manage a group of jobs, and I want to be able to decide whether to fail the
JobSet or not, based on the exact container exit code that caused a child job failure.

**Example JobSet for this use case**:

```yaml
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: fail-jobset-example
spec:
failurePolicy:
rules:
# If Job fails due to a pod failing with exit code 2, fail the JobSet immediately, without attempting any restarts.
- action: FailJobSet
targetReplicatedJobs:
- workers
onJobFailureReasons:
- ExitCode2 # Matches Reason defined in .spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].setConditionReason
maxRestarts: 10
replicatedJobs:
- name: workers
replicas: 10
template:
spec:
parallelism: 1
completions: 1
backoffLimit: 0
# If a pod fails with exit code 2, fail the job with the user-defined reason.
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main
operator: In
values: [2]
setConditionReason: "ExitCode2" # Matches Reason defined in .spec.failurePolicy.rules[0].onJobFailureReasons[0]
template:
spec:
restartPolicy: Never
containers:
- name: main
image: python:3.10
command: ["..."]
```
#### Story 2
As a user, I am using a JobSet to manage a group of jobs, each running a HPC simulation.
Each job runs a simulation with different random initial parameters. When a simulation ends, the
Expand All @@ -214,9 +265,7 @@ a failed state.
When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.
**Example JobSet with a Pod Failure Policy configuration for this use case**:

Note: spec.replicatedJobs.template.spec.podFailurePolicy
**Example JobSet for this use case**:
```yaml
apiVersion: jobset.x-k8s.io/v1alpha2
Expand All @@ -233,13 +282,13 @@ spec:
targetReplicatedJobs:
- simulations
onJobFailureReasons:
- ExitCode2 # Matches Reason defined in .spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].reason
- ExitCode2 # Matches Reason defined in .spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].setConditionReason
# If Job fails due to a pod failing with exit code 3, restart that Job.
- action: RestartJob
targetReplicatedJobs:
- simulations
onJobFailureReasons:
- ExitCode3 # Matches Reason defined in .spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[1].reason
- ExitCode3 # Matches Reason defined in .spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[1].setConditionReason
maxRestarts: 10
replicatedJobs:
- name: simulations
Expand All @@ -257,13 +306,13 @@ spec:
containerName: main
operator: In
values: [2]
reason: "ExitCode2" # Matches Reason defined in .spec.failurePolicy.rules[0].onJobFailureReasons[0]
setConditionReason: "ExitCode2" # Matches Reason defined in .spec.failurePolicy.rules[0].onJobFailureReasons[0]
- action: FailJob
onExitCodes:
containerName: main
operator: In
values: [3]
reason: "ExitCode3" # Matches Reason defined in .spec.failurePolicy.rules[1].onJobFailureReasons[0]
setConditionReason: "ExitCode3" # Matches Reason defined in .spec.failurePolicy.rules[1].onJobFailureReasons[0]
template:
spec:
restartPolicy: Never
Expand Down Expand Up @@ -350,7 +399,7 @@ This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->

- `k8s.io/kubernetes/pkg/controller/job`: `02/05/2024` - `<test coverage>`
- `k8s.io/kubernetes/pkg/controller/job`: `02/05/2024` - `91.5%`

##### Integration tests

Expand Down Expand Up @@ -387,6 +436,11 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
-->

<!-- - <test>: <link to test coverage> -->
We will a test case similar to the integration test case:

- When the feature flag is enabled and a Job's PodFailurePolicy triggers a Job failure, due to a
matching PodFailurePolicyRule with the `SetConditionReason` field defined, check that the `JobFailed`
condition has the user-specified `SetConditionReason` set on it correctly.

### Graduation Criteria

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
title: KEP Template
title: Configurable Job failure reason for PodFailurePolicyRule
kep-number: 4443
authors:
- "@danielvegamyhre"
Expand All @@ -10,7 +10,6 @@ reviewers:
- "@kannon92"
approvers:
- "@alculquicondor"
- "@msau42"

see-also:
- "https://github.com/kubernetes-sigs/jobset/pull/381"
Expand Down

0 comments on commit ac89cc7

Please sign in to comment.