-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3757
Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3757
Conversation
mimowo
commented
Jan 19, 2023
•
edited
Loading
edited
- One-line PR description: Update graduation criteria for GA
- Issue link: Retriable and non-retriable Pod failures for Jobs #3329
- Main changes:
- proposal to modify Kubelet to transition into Failed scheduled Pending pods which are Terminating (and with finalizer)
- refreshed GA graduation criteria
b9b7791
to
6aa4aa7
Compare
ad51a0a
to
d9271eb
Compare
078daec
to
0736384
Compare
d71f1e8
to
c3df52a
Compare
@bobbypage @SergeyKanzhelev I've updated the PR, addressing the comments. PTAL and let me know if there is something more requiring an update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good from a PRR perspective, except we have added one question to the PRR this cycle:
https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md?plain=1#L723
Can you answer that please?
keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
Outdated
Show resolved
Hide resolved
keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
Outdated
Show resolved
Hide resolved
This scenario is a little bit different, the config is correct, but the | ||
pod is deleted by a user while in the `Pending` phase. In that case, the | ||
pods transition into the `Running` phase and fail soon after. With the proposed | ||
change the transition will happen earlier, thus saving resources. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not super clear why this scenario will result in phase=Failed being applied, but not the previous scenario. Not needed for the KEP doc itself, but I think we should understand what is going on here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario the pod actually transitions to the Running
phase, but the container is killed, respending the grace period. If the container completes with exitCode 0
within the graceful period it actually ends the entire pod in Succeed state, otherwise it ends with exitCode 137
.
EDIT: updated the scenario text to mention the relationship with the graceful period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks for clarifying this in the doc.
I have one more comment about the terminology used in the KEP (terminating vs deleting), but LGTM on the content, thank you for all of the updates and clarifying the scenarios. |
250f603
to
570d0f3
Compare
570d0f3
to
439c237
Compare
Done, PTAL. The feature only introduces an additional API PATCH call, what is answered in the other point. |
aca8704
to
0527245
Compare
0527245
to
8ce4cde
Compare
Ok, PRR looks good and is approved (no prow command needed this time, but want to make it explicit for the enhancements team). |
Thanks for all the updates and clarifications in the KEP! /lgtm |
Thanks @bobbypage I also reviewed it. /lgtm |
@bobbypage @dchen1107 thank you for completing the review! |