-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4781: Fix inconsistent container start and ready state after kubelet restart #4784
base: master
Are you sure you want to change the base?
Conversation
@pololowww: GitHub didn't allow me to request PR reviews from the following users: chenk008. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Welcome @pololowww! |
Hi @pololowww. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @bart0sh @SergeyKanzhelev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @matthyx
Thank you for your PR. Please sign the CLA to proceed further, thanks. |
``` | ||
##### Changes in probe | ||
|
||
We will update the UpdatePodStatus function to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can also properly set started
for a container (ref kubernetes/kubernetes#115553)
We will update the UpdatePodStatus function to: | ||
|
||
1. Track containers with readiness probes. | ||
2. Preserve the previous ready state for containers with readiness probes that haven't run yet after kubelet restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how you plan to persist ready states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from API server - read the state of Pods as an initial state for probe managers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the ready state will be achieved from pod.Status.Conditions.
I have modified the details of my KEP to support inheritance of the Start and Ready states, and reconsidered the scenarios where the kubelet startup time exceeds the period. Updated in the new commit already. |
## Drawbacks | ||
|
||
<!-- | ||
Why should this KEP _not_ be implemented? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypothetically, if an edge case occurs where a pod that was ready transitions to an unready state when kubelet restarts, and we choose to treat it as still ready, the service's endpoints would still include this pod. This could lead to network traffic being directed to a pod that has encountered an issue.
This KEP is not without risks. We should articulate its drawbacks here and explain why, despite these drawbacks, we are still choosing to proceed with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestions:-). Risks or drawbacks are already mentioned in my new commit. We also plan to trigger the probe immediately to reduce the risks.
BTW this is my first to write a KEP and i am not sure how detailed it should be.
|
||
##### Changes in kubelet_pods.go | ||
|
||
We will be updating the `convertToAPIContainerStatuses` function, which is responsible for calculating the API-recognizable v1.ContainerStatus based on the old and current container states. Our modification will involve preserving the Started status from the oldStatus when the container's start time is earlier than the kubelet's start time. This will facilitate the subsequent `UpdatePodStatus` to inherit the Started state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we obtain the startup time of kubelet, could you briefly explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the startup time of Kubelet probeManager is more accurate(worker.probeManager.start
). Modified already in the new commit:)
extending the production code to implement this enhancement. | ||
--> | ||
|
||
- `<package>`: `<date>` - `<test coverage>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add your test plan. I assume you've already implemented a POC, so this should be easy for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this feature will definitely need strong test coverage and may not be straightforward to test. we should start planning that early.
/cc |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am super excited to see this finally being addressed. THANK YOU!
|
||
##### Changes in kubelet_pods.go | ||
|
||
We will be updating the `convertToAPIContainerStatuses` function, which is responsible for calculating the API-recognizable v1.ContainerStatus based on the old and current container states. Our modification will involve preserving the Started status from the oldStatus when the container's start time is earlier than the kubelet's start time. This will facilitate the subsequent `UpdatePodStatus` to inherit the Started state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused about the use of time as an indicator here, but I admit this is not an area of the code I am familiar with.
The way I think about this:
When kubelet starts up, it reads the list of intended pods (and their last known state) from apiserver and it reads the list of actual pods from the CRI.
If it finds a pod in the API that is not running, it starts the pod (yes, I know there's nuance here, especially around state) and updates status.
If it finds a running pod that the API does not have (yes, include mirror pods in that) it kills the pod and cleans up.
If it finds a running pod which corresponds to a pod in the API, it should try to resume the state-machine where the last state indicates. If it was Ready, it stays ready until the probes fail. If it was Unready, it stays unready until the probes succeed. This is the main issue at hand in this KEP, right?
Where does comparing timestamps come in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct. The comparison of timestamps occurs before checking all containers. It mainly serves two purposes:
- to narrow down the comparison scope
- to avoid cases where the previous container has been restarted. If the container's start time is later than the Kubelet, we will no longer retain the previous state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still unclear to me, and I want to make sure the spec is as clear as possible.
- to narrow down the comparison scope
What does this mean?
- to avoid cases where the previous container has been restarted. If the container's start time is later than the Kubelet, we will no longer retain the previous state.
Let me see if I can construct the case you're worried about, please tell me if I am missing it.
t0: container is running and Ready
t1: kubelet goes down
t2: container crashes
t3: CRI impl restarts the container (which is, practically, NotReady)
t4: kubelet comes back up and finds the container running
t5: kubelet does not yet have probe results, so it leaves the container as Ready
t6: eventually the probe shows the container as NotReady
Is that right? If so, I am really not very worried about it.
Referencing myself (sorry) in kubernetes/kubernetes#100277 (comment) : In a case like this the pod stayed in service a little longer than it should have. This is not a huge deal and happens any time there is a "surprise" change (e.g. the node crashed).
Additionally, the pod start time isn't the only relevant indicator. That timeline could just as easily be:
t0: container is running and Ready
t1: kubelet goes down
t2: container becomes unready (probes would fail, but app does not crash)
t3: kubelet comes back up and finds the container running
t4: kubelet does not yet have probe results, so it leaves the container as Ready
t5: eventually the probe shows the container as NotReady
There's no timestamp you can compare (I think?) that detects that situation, and the whole reason I think we need to change this behavior is because "assuming the worst" (setting it to NotReady) is way more expensive and problematic than "wait and see".
Upon a kubelet restart, we should probably prioritize "run probes ASAP".
TL;DR - I think that kubelet should only set the status when it actually HAS information, not as a side-effect of anything else. If you don't KNOW a pod is unready (by probes or some other explicit mechanism) then you shouldn't set it unready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not replying recently :( When updating the pod status, we iterate through all the containers within the pod to identify the containers whose states need to be inherited. Therefore, directly skipping containers with a later start time can narrow down the comparison scope.
Regarding the first scenario you described and your considerations, I think your understanding is reasonable. It's indeed not a big issue if the pod's state is retained slightly longer.
In the second scenario, we plan to immediately trigger the probes to let them determine the final state of the pod.
In summary, I think the comparison of the start times can be kept for the purpose of narrowing down the scope. Perhaps I shouldn't have emphasized this point in the KEP. What do you think?
/ok-to-test |
Can you also add a prod-readiness file in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node named |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pololowww The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pololowww did you mean to close this PR? |
@SergeyKanzhelev No, i just wanted to close a comment and maybe made some mistakes... Can we reopen it? I need to merge this PR actually. |
@pololowww Could you please fix the failing test? You just need to run make update-toc locally and submit the updated content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PRR Shadow]
extending the production code to implement this enhancement. | ||
--> | ||
|
||
- `<package>`: `<date>` - `<test coverage>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this feature will definitely need strong test coverage and may not be straightforward to test. we should start planning that early.
- [ ] Feature gate (also fill in values in `kep.yaml`) | ||
- Feature gate name: `PodUnreadyOnKubeletRestart` | ||
- Components depending on the feature gate: `kubelet` | ||
- [ ] Other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
|
||
Due to the use of a feature gate, the feature can be disabled by setting the gate to false. | ||
|
||
###### What happens if we reenable the feature if it was previously rolled back? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answer this one?
### Rollout, Upgrade and Rollback Planning | ||
|
||
<!-- | ||
This section must be completed when targeting beta to a release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it must be completed only by Beta, I'd encourage thinking about beta/GA requirements sooner than later.
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ | ||
--> | ||
|
||
- [ ] Feature gate (also fill in values in `kep.yaml`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [ ] Feature gate (also fill in values in `kep.yaml`) | |
- [x] Feature gate (also fill in values in `kep.yaml`) |
feature gate after having objects written with the new field) are also critical. | ||
You can take a look at one potential example of such test in: | ||
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add something here. At the very minimum, for this feature, we should include tests that verify the behavior with the feature turned off.
Quick reminder that enhancement freeze is today and there is outstanding Production Readiness Review feedback that would need to be addressed for this to be approved. |
Hey @pololowww, this is a really excellent first time KEP and a lot of folks are really excited about it. I saw you haven't been around for a few months, but we'd love to help you drive this through to merge. Is this still active on your side / would you like some support in closing this out? |
/cc @thockin