Skip to content

Commit

Permalink
fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
mimowo committed Jan 23, 2023
1 parent c9ecade commit 6aa4aa7
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ thousands of nodes requires usage of pod restart policies in order
to account for infrastructure failures.

Currently, kubernetes Job API offers a way to account for infrastructure
failures by setting `.backoffLimit > 0`. However, this mechanism intructs the
failures by setting `.backoffLimit > 0`. However, this mechanism instructs the
job controller to restart all failed pods - regardless of the root cause
of the failures. Thus, in some scenarios this leads to unnecessary
restarts of many pods, resulting in a waste of time and computational
Expand Down Expand Up @@ -354,7 +354,7 @@ As a machine learning researcher, I run jobs comprising thousands
of long-running pods on a cluster comprising thousands of nodes. The jobs often
run at night or over weekend without any human monitoring. In order to account
for random infrastructure failures we define `.backoffLimit: 6` for the job.
However, a signifficant portion of the failures happen due to bugs in code.
However, a significant portion of the failures happen due to bugs in code.
Moreover, the failures may happen late during the program execution time. In
such case, restarting such a pod results in wasting a lot of computational time.

Expand Down Expand Up @@ -729,7 +729,7 @@ in different messages for pods.
- Reproduction: Run kube-controller-manager with disabled taint-manager (with the
flag `--enable-taint-manager=false`). Then, run a job with a long-running pod and
disconnect the node
- Comments: handled by node lifcycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
- Comments: handled by node lifecycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
However, the pod phase remains `Running`.
- Pod status:
- status: Unknown
Expand Down Expand Up @@ -863,7 +863,7 @@ dies) between appending a pod condition and deleting the pod.
In particular, scheduler can possibly decide to preempt
a different pod the next time (or none). This would leave a pod with a
condition that it was preempted, when it actually wasn't. This in turn
could lead to inproper handling of the pod by the job controller.
could lead to improper handling of the pod by the job controller.

As a solution we implement a worker, part of the disruption
controller, which clears the pod condition added if `DeletionTimestamp` is
Expand Down Expand Up @@ -1218,7 +1218,7 @@ the pod failure does not match any of the specified rules, then default
handling of failed pods applies.

If we limit this feature to use `onExitCodes` only when `restartPolicy=Never`
(see: [limitting this feature](#limitting-this-feature)), then the rules using
(see: [limiting this feature](#limitting-this-feature)), then the rules using
`onExitCodes` are evaluated only against the exit codes in the `state` field
(under `terminated.exitCode`) of `pod.status.containerStatuses` and
`pod.status.initContainerStatuses`. We may also need to check for the exit codes
Expand Down Expand Up @@ -1279,9 +1279,9 @@ the following scenarios will be covered with unit tests:
- handling of a pod failure, in accordance with the specified `spec.podFailurePolicy`,
when the failure is associated with
- a failed container with non-zero exit code,
- a dedicated Pod condition indicating termmination originated by a kubernetes component
- a dedicated Pod condition indicating termination originated by a kubernetes component
- adding of the `DisruptionTarget` by Kubelet in case of:
- eviciton due to graceful node shutdown
- eviction due to graceful node shutdown
- eviction due to node pressure
<!--
Additionally, for Alpha try to enumerate the core package you will be touching
Expand Down Expand Up @@ -1313,7 +1313,7 @@ The following scenarios will be covered with integration tests:
- pod failure is caused by a failed container with a non-zero exit code

More integration tests might be added to ensure good code coverage based on the
actual implemention.
actual implementation.

<!--
This question should be filled when targeting a release.
Expand Down Expand Up @@ -1406,8 +1406,8 @@ Below are some examples to consider, in addition to the aforementioned [maturity
- Simplify the code in job controller responsible for detection of failed pods
based on the fix for pods stuck in the running phase (see: [Marking pods as Failed](marking-pods-as-failed)).
- Discuss within the community (involving CNCF Technical Advisory Group for
Runtime, SIG-node, container runtime implementations) the standarization of
the CRI API to communicate an OOM kill occurrence by contatiner runtime to
Runtime, SIG-node, container runtime implementations) the standardization of
the CRI API to communicate an OOM kill occurrence by container runtime to
Kubelet. In particular, suggest that the API should allow to convey the reason
for OOM killer being invoked (to distinguish if the container was killed due
to exceeding its limits or due to system running low on memory).
Expand Down Expand Up @@ -1437,7 +1437,7 @@ N/A
An upgrade to a version which supports this feature should not require any
additional configuration changes. In order to use this feature after an upgrade
users will need to configure their Jobs by specifying `spec.podFailurePolicy`. The
only noticeable difference in behaviour, without specifying `spec.podFailurePolicy`,
only noticeable difference in behavior, without specifying `spec.podFailurePolicy`,
is that Pods terminated by kubernetes components will have an additional
condition appended to `status.conditions`.

Expand Down Expand Up @@ -1652,7 +1652,7 @@ Manual test performed to simulate the upgrade->downgrade->upgrade scenario:
- Scenario 2:
- Create a job with a long running containers and `backoffLimit=0`.
- Verify that the job continues after the node in uncordoned
1. Disable the feature gates. Verify that the above scenarios result in default behaviour:
1. Disable the feature gates. Verify that the above scenarios result in default behavior:
- In scenario 1: the job restarts pods failed with exit code `42`
- In scenario 2: the job is failed due to exceeding the `backoffLimit` as the failed pod failed during the draining
1. Re-enable the feature gates
Expand Down Expand Up @@ -1951,7 +1951,7 @@ technics apply):
is an increase of the Job controller processing time.
- Inspect the Job controller's `job_pods_finished_total` metric for the
to check if the numbers of pod failures handled by specific actions (counted
by the `failure_policy_action` label) agree with the expetations.
by the `failure_policy_action` label) agree with the expectations.
For example, if a user configures job failure policy with `Ignore` action for
the `DisruptionTarget` condition, then a node drain is expected to increase
the metric for `failure_policy_action=Ignore`.
Expand All @@ -1961,7 +1961,7 @@ technics apply):

- 2022-06-23: Initial KEP merged
- 2022-07-12: Preparatory PR "Refactor gc_controller to do not use the deletePod stub" merged
- 2022-07-14: Preparatory PR "efactor taint_manager to do not use getPod and getNode stubs" merged
- 2022-07-14: Preparatory PR "Refactor taint_manager to do not use getPod and getNode stubs" merged
- 2022-07-20: Preparatory PR "Add integration test for podgc" merged
- 2022-07-28: KEP updates merged
- 2022-08-01: Additional KEP updates merged
Expand All @@ -1970,7 +1970,7 @@ technics apply):
- 2022-08-04: PR "Support handling of pod failures with respect to the configured rules" merged
- 2022-09-09: Bugfix PR for test "Fix the TestRoundTripTypes by adding default to the fuzzer" merged
- 2022-09-26: Prepared PR for KEP Beta update. Summary of the changes:
- propsal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
- proposal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
- DisruptionTarget for evictions due graceful node shutdown, admission errors, node pressure or Pod admission errors
- ResourceExhausted for evictions due to OOM killer and exceeding Pod's ephemeral-storage limits
- extended the review of pod eviction scenarios by kubelet-initiated pod evictions:
Expand Down

0 comments on commit 6aa4aa7

Please sign in to comment.