Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4781: Fix inconsistent container start and ready state after kubelet restart #4784

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pololowww
Copy link

@pololowww pololowww commented Aug 9, 2024

  • One-line PR description: Adding new KEP to fix inconsistent container start and ready state after kubelet restart

@k8s-ci-robot
Copy link
Contributor

@pololowww: GitHub didn't allow me to request PR reviews from the following users: chenk008.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

  • One-line PR description: Adding new KEP to fix inconsistent container ready state after kubelet restart

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

linux-foundation-easycla bot commented Aug 9, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @pololowww!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @pololowww. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 9, 2024
@pacoxu
Copy link
Member

pacoxu commented Aug 9, 2024

/cc @bart0sh @SergeyKanzhelev
/assign @mrunalp

Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @matthyx

@bart0sh
Copy link
Contributor

bart0sh commented Aug 9, 2024

Thank you for your PR. Please sign the CLA to proceed further, thanks.

```
##### Changes in probe

We will update the UpdatePodStatus function to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can also properly set started for a container (ref kubernetes/kubernetes#115553)

We will update the UpdatePodStatus function to:

1. Track containers with readiness probes.
2. Preserve the previous ready state for containers with readiness probes that haven't run yet after kubelet restart.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how you plan to persist ready states

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from API server - read the state of Pods as an initial state for probe managers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ready state will be achieved from pod.Status.Conditions.

@pololowww
Copy link
Author

I have modified the details of my KEP to support inheritance of the Start and Ready states, and reconsidered the scenarios where the kubelet startup time exceeds the period. Updated in the new commit already.
Generally, when containerStartTime is before the Kubelet, we default to inheriting the previous container Start status, and the Ready status is achieved from API Server. If there are any unreasonable aspects to consider, please let me know :-)

@pololowww pololowww changed the title KEP-4781: Fix inconsistent container ready state after kubelet restart KEP-4781: Fix inconsistent container start and ready state after kubelet restart Aug 14, 2024
keps/sig-node/4781-kublet-restart-pod-status/README.md Outdated Show resolved Hide resolved
## Drawbacks

<!--
Why should this KEP _not_ be implemented?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically, if an edge case occurs where a pod that was ready transitions to an unready state when kubelet restarts, and we choose to treat it as still ready, the service's endpoints would still include this pod. This could lead to network traffic being directed to a pod that has encountered an issue.

This KEP is not without risks. We should articulate its drawbacks here and explain why, despite these drawbacks, we are still choosing to proceed with it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions:-). Risks or drawbacks are already mentioned in my new commit. We also plan to trigger the probe immediately to reduce the risks.
BTW this is my first to write a KEP and i am not sure how detailed it should be.


##### Changes in kubelet_pods.go

We will be updating the `convertToAPIContainerStatuses` function, which is responsible for calculating the API-recognizable v1.ContainerStatus based on the old and current container states. Our modification will involve preserving the Started status from the oldStatus when the container's start time is earlier than the kubelet's start time. This will facilitate the subsequent `UpdatePodStatus` to inherit the Started state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we obtain the startup time of kubelet, could you briefly explain?

Copy link
Author

@pololowww pololowww Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the startup time of Kubelet probeManager is more accurate(worker.probeManager.start). Modified already in the new commit:)

extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add your test plan. I assume you've already implemented a POC, so this should be easy for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this feature will definitely need strong test coverage and may not be straightforward to test. we should start planning that early.

@ffromani
Copy link
Contributor

/cc

@songminglong
Copy link

/cc

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 23, 2024
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am super excited to see this finally being addressed. THANK YOU!


##### Changes in kubelet_pods.go

We will be updating the `convertToAPIContainerStatuses` function, which is responsible for calculating the API-recognizable v1.ContainerStatus based on the old and current container states. Our modification will involve preserving the Started status from the oldStatus when the container's start time is earlier than the kubelet's start time. This will facilitate the subsequent `UpdatePodStatus` to inherit the Started state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused about the use of time as an indicator here, but I admit this is not an area of the code I am familiar with.

The way I think about this:

When kubelet starts up, it reads the list of intended pods (and their last known state) from apiserver and it reads the list of actual pods from the CRI.

If it finds a pod in the API that is not running, it starts the pod (yes, I know there's nuance here, especially around state) and updates status.

If it finds a running pod that the API does not have (yes, include mirror pods in that) it kills the pod and cleans up.

If it finds a running pod which corresponds to a pod in the API, it should try to resume the state-machine where the last state indicates. If it was Ready, it stays ready until the probes fail. If it was Unready, it stays unready until the probes succeed. This is the main issue at hand in this KEP, right?

Where does comparing timestamps come in?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. The comparison of timestamps occurs before checking all containers. It mainly serves two purposes:

  1. to narrow down the comparison scope
  2. to avoid cases where the previous container has been restarted. If the container's start time is later than the Kubelet, we will no longer retain the previous state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unclear to me, and I want to make sure the spec is as clear as possible.

  1. to narrow down the comparison scope

What does this mean?

  1. to avoid cases where the previous container has been restarted. If the container's start time is later than the Kubelet, we will no longer retain the previous state.

Let me see if I can construct the case you're worried about, please tell me if I am missing it.

t0: container is running and Ready
t1: kubelet goes down
t2: container crashes
t3: CRI impl restarts the container (which is, practically, NotReady)
t4: kubelet comes back up and finds the container running
t5: kubelet does not yet have probe results, so it leaves the container as Ready
t6: eventually the probe shows the container as NotReady

Is that right? If so, I am really not very worried about it.

Referencing myself (sorry) in kubernetes/kubernetes#100277 (comment) : In a case like this the pod stayed in service a little longer than it should have. This is not a huge deal and happens any time there is a "surprise" change (e.g. the node crashed).

Additionally, the pod start time isn't the only relevant indicator. That timeline could just as easily be:

t0: container is running and Ready
t1: kubelet goes down
t2: container becomes unready (probes would fail, but app does not crash)
t3: kubelet comes back up and finds the container running
t4: kubelet does not yet have probe results, so it leaves the container as Ready
t5: eventually the probe shows the container as NotReady

There's no timestamp you can compare (I think?) that detects that situation, and the whole reason I think we need to change this behavior is because "assuming the worst" (setting it to NotReady) is way more expensive and problematic than "wait and see".

Upon a kubelet restart, we should probably prioritize "run probes ASAP".

TL;DR - I think that kubelet should only set the status when it actually HAS information, not as a side-effect of anything else. If you don't KNOW a pod is unready (by probes or some other explicit mechanism) then you shouldn't set it unready.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not replying recently :( When updating the pod status, we iterate through all the containers within the pod to identify the containers whose states need to be inherited. Therefore, directly skipping containers with a later start time can narrow down the comparison scope.
Regarding the first scenario you described and your considerations, I think your understanding is reasonable. It's indeed not a big issue if the pod's state is retained slightly longer.
In the second scenario, we plan to immediately trigger the probes to let them determine the final state of the pod.

In summary, I think the comparison of the start times can be kept for the purpose of narrowing down the scope. Perhaps I shouldn't have emphasized this point in the KEP. What do you think?

@pacoxu
Copy link
Member

pacoxu commented Sep 20, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2024
@haircommander
Copy link
Contributor

Can you also add a prod-readiness file in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node named 4781.yaml and choose a PRR reviewer to do PRR review (similar to https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/127.yaml, but you only need alpha). Don't worry too much about which reviewer you choose, as they'll internally load balance.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pololowww
Once this PR has been reviewed and has the lgtm label, please ask for approval from mrunalp and additionally assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pololowww pololowww closed this Oct 4, 2024
@SergeyKanzhelev
Copy link
Member

@pololowww did you mean to close this PR?

@pololowww
Copy link
Author

@pololowww did you mean to close this PR?

@SergeyKanzhelev No, i just wanted to close a comment and maybe made some mistakes... Can we reopen it? I need to merge this PR actually.

@pololowww pololowww reopened this Oct 5, 2024
@HirazawaUi
Copy link
Contributor

@pololowww Could you please fix the failing test? You just need to run make update-toc locally and submit the updated content.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PRR Shadow]

extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this feature will definitely need strong test coverage and may not be straightforward to test. we should start planning that early.

- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: `PodUnreadyOnKubeletRestart`
- Components depending on the feature gate: `kubelet`
- [ ] Other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?


Due to the use of a feature gate, the feature can be disabled by setting the gate to false.

###### What happens if we reenable the feature if it was previously rolled back?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer this one?

### Rollout, Upgrade and Rollback Planning

<!--
This section must be completed when targeting beta to a release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it must be completed only by Beta, I'd encourage thinking about beta/GA requirements sooner than later.

[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->

- [ ] Feature gate (also fill in values in `kep.yaml`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [ ] Feature gate (also fill in values in `kep.yaml`)
- [x] Feature gate (also fill in values in `kep.yaml`)

feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add something here. At the very minimum, for this feature, we should include tests that verify the behavior with the feature turned off.

@jpbetz
Copy link
Contributor

jpbetz commented Oct 10, 2024

Quick reminder that enhancement freeze is today and there is outstanding Production Readiness Review feedback that would need to be addressed for this to be approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.