-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3085: Add KEP for Pod network ready condition #3087
Conversation
I think this would be useful since it can be hard to determine why pods are not running if there are errors during CreatePodSandbox - kubernetes/kubernetes#104635 |
/assign |
some more info here: kubernetes/kubernetes#105984 |
ff5fc47
to
cd53e91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds reasonable and useful, a few minor nitpicks though.
keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md
Outdated
Show resolved
Hide resolved
/assign @derekwaynecarr |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we changed this condition to PodHasNetwork
would that satisfy your use case?
The attributes of what is implied by a sandbox is vague, but saying the pod has an IP address is a clearer condition for future evolution.
In the stop pod case, it would map cleanly as it no longer has a network.
Thoughts?
operators (especially of multi-tenant clusters) who are responsible for | ||
configuration and operational aspects of the various components that play a role | ||
in pod sandbox creation: CSI plugins, CRI runtime and associated runtime | ||
handlers, CNI plugins, etc. The duration between `lastTransitionTime` field of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to get clarity on when this condition is set to true to call out what is or is not missed.
So when a kubelet sees a pod that must be started.
- It creates the pod level cgroup.
- It creates the etc-hosts file for dns configuration.
- It creates the data directories on local host (emptyDir, etc.)
- It waits for volumes to attach and mount for associated pod.
- It fetches the pull secret associated with pod (if any) uses to pull its container image.
- It requests the CRI to create the pod sandbox for the CRI.
- It requests the kubelet to pull an image if not present.
- It creates pod containers according to the pod spec.
- It starts pod containers according to the pod spec.
If I understand the proposal, this condition would go TRUE if step 6 is satisfied.
This means any SLI/SLO held by a cluster administrator is subject to issues related to the pod author/deployer.
- If a referenced secret or configmap is not available
- If referenced pull secret is not available
Would you want to message those conditions differently?
around pod initialization to their customers who launch workloads on the | ||
cluster. | ||
|
||
Custom pod controllers/operators can use a dedicated condition indicating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only thing you can pro-actively do is delete a pod that is not having its Sandbox go ready, but the underlying reason for why it did not go ready may vary by reason (secret does not exist, configmap does not exist, pull secret is invalid, volumes cannot be attached). Let me know if I am misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting the pod and re-creating is absolutely necessary if a pod is not coming up - there is no avoiding that. What I was trying to allude to here is: for controllers whose custom resources specify other dependent resources (like VolumeClaimTemplates to generate a PVC from that the pod would mount), the extra condition can serve as a signal on whether to try to also delete the dependent resource (like a PVCs that the failing pod mounts) and re-create them versus just re-creating the pod that is failing to come up. If the new condition is reporting true
for failed pods, pod re-creation is all that needs to be attempted. If the new condition is reporting false
, there is most likely a deeper underlying problem - for example, PVs (bound to the PVCs spun up based on templates) are not attaching/mounting and the controller may want to try to recreate those as well to get the pod to successfully come up.
The new condition enables controllers to make finer distinctions on what optimal strategy to use to retry when encountering failures to bring up a pod.
controller can leave PVCs intact and only recreate pods if sandbox creation | ||
completes successfully but the pod's containers fail to become ready. | ||
|
||
When a pod's sandbox no longer exists, the `status` of `SandboxReady` condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a pod is stopped (it has reached a terminal phase), but is not yet deleted, the sandbox is still present and past containers logs can be accessed. In this state, is the SandboxReady
true or false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the sandbox is present, the new condition should report true.
Some flavors of We might also want to define how to represent this condition if using a CNIs that implements NetworkPolicy. That might be better left to a separate condition, though, to avoid holding up this enhancement? |
Setting up of the network concludes the pod "sandbox" creation process coordinated between Kubelet => CRI runtime => CNI plugins. Thus, in the context of this KEP, the new If we want more granular network-related conditions (which is not within scope of the KEP), individual CNI plugins (with a API server client) can mark any necessary condition on the pod using extended conditions - for example, Note that until all CNI plugins in the configured chain do succeed, CRI runtime won't return an "in-process"/intermediate outcome to Kubelet. Either all networking configuration and IP allocation succeeded (i.e. networking is fully up for the pod) OR network configuration failed and the entire CRI sandbox creation + network configuration has to be retried. |
@derekwaynecarr I have updated the KEP with the suggested |
/retitle KEP-3085: Add KEP for Pod network ready condition Hope that's OK. |
keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/kep.yaml
Outdated
Show resolved
Hide resolved
@ddebroy consider omitting “draft” from the PR description at this point. |
keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md
Outdated
Show resolved
Hide resolved
Signed-off-by: Deep Debroy <ddebroy@gmail.com>
PRR looks fine, I will wait until there is SIG approval though because I am root in this repo |
becomes especially relevant in multi-tenant clusters where individual tenants | ||
own the pod specs (including the set of init containers) while the cluster | ||
administrators are in charge of storage plugins, networking plugins and | ||
container runtime handlers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the mismatching phase interprets on Initialized
between pods with init containers and without. But the issue isn't obvious to the users since we hide the pod sandbox creation completely in the implementation. To the end user, it is more like all init containers are successfully run, now it is the time to start the app containers. With the newly proposed PodCondition: PodHasNetwork, shouldn't the issue surface to the end users now?
With init containers: PodHasNetwork -> Initialized -> ... -> Ready
Without init containers: Initialized -> PodHasNetwork -> ... -> Ready
Can we consolidate above flows? We can move Initialized after PodHasNetwork to indicate Kubelet now can start any app containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the issue isn't obvious to the users since we hide the pod sandbox creation completely in the implementation. To the end user, it is more like all init containers are successfully run, now it is the time to start the app containers. With the newly proposed PodCondition: PodHasNetwork, shouldn't the issue surface to the end users now?
The inconsistency already surfaces with the way things are today when CSI/CRI/CNI is unable to mount volumes or launch the pod sandbox upon encountering errors: without init containers, today, the pod status reports the Initialized condition to be true but Kubelet also reports error events like FailedCreatePodSandBox
, FailedMount
, etc against the pod (which are reported as part of kubectl describe
etc)
Can we consolidate above flows? We can move Initialized after PodHasNetwork to indicate Kubelet now can start any app containers.
We can certainly consider that. I will add that to the KEP post-merge of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion on this in sig-node on July 5th. The suggestion was to carry on with the PodHasNetwork
condition as described in this KEP as the ordering of transition of pod conditions should not be important. We can separately address the situation with Initialized
condition for pods without init containers (mentioned above) in a separate KEP as that feels like an independent problem.
What is out of scope for this KEP? Listing non-goals helps to focus discussion | ||
and make progress. | ||
--> | ||
- Modify the meaning of the existing `Initialized` condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my above comment.
Add some more comments, but overall KEP is well written and lgtm. To unblock the progress for such useful feature, I will approve this to meet the deadline. /lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, ddebroy, ehashman, johnbelamaric, qiutongs The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddebroy - Nice features - I just added two minor comments. If you could adjust them in the follow up PR it would be great.
automations, so be extremely careful here. | ||
--> | ||
|
||
No changes to any default behavior should result from enabling the feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully agree with this - by default each pod will get a new condition set.
So if someone is doing equality comparisons, they will start seeing changes.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos | ||
--> | ||
|
||
No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may potentially increase pod-startup-time (which includes reporting the state) if kubelet is starved on QPS limits.
[This is fine - I'm just pointing this out for completeness...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that the Kubelet Status Manager caches all status updates and queues and sends them to API server in an async fashion. So the updates to the pod conditions (existing ones as well as the new one added in this KEP) do not happen synchronously with actual pod creation and initialization activities. Given this, do you think the above may still be a concern @wojtek-t ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I knew that. But still - if the conditions are not changing fast enough and individual phases take a bit of time, we may need to do more API calls. And in that case, it may increase pod startup if kubelet is cpu-starved.
As I mentioned - I don't treat it as a concern or anything blocking - I just wanted it to be added for completeness.
KEP to surface
PodHasNetwork
condition in pods to indicate when pod sandbox creation and network configuration of pod through CRI runtime is complete.