-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: DNM: Hack on kep 3085 #3501
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: thockin The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6ae61fc
to
d37e553
Compare
controller for managing pods that refer to PVCs associated with node local | ||
storage (e.g. Rook-Ceph) may decide to recreate PVCs (based on a specified PVC | ||
template in the custom resource the controller is managing) if the sandbox | ||
creation is repeatedly failing to complete, indicated by the new `PodHasNetwork` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should have a status for the sandbox overall instead of specifically "has network" ?
I see #3087 (review), but it still seems like the motivation is knowing if the sandbox is ready and because the sandbox is somewhat nebulous I think it's wrong to couple directly to "has network" when we're really looking for "CRI finished sandbox setup"?
around pod sandbox creation latency, surfacing the `PodHasNetwork` condition in | ||
pod status will allow a metric collection service to directly access the | ||
relevant data without requiring the ability to collect and parse OpenTelemetry | ||
traces. As mentioned in the User Stories, popular community managed services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the user is looking for more granular data on timings, surely a common tracing standard is the answer?
How granular would we eventually go with conditions vs tracing?
`PodHasNetwork` condition reporting `true`) but the pod's containers fail to | ||
become ready. Further details of this is covered in a [user story](#story-2-consuming-podhasnetwork-condition-in-a-controller) below. | ||
|
||
When a pod's sandbox no longer exists, the `status` of `PodHasNetwork` condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I see PodHasNetwork
my first thought was that this is a "is the network working right now?" type indicator. Should this be named something like PodNetworkConfigured
?
runtime and associated runtime handlers, CNI plugins, etc. The duration between | ||
`lastTransitionTime` field of the `PodHasNetwork` condition (with `status` set | ||
to `true` for a pod for the first time) and the existing `PodScheduled` | ||
condition will allow metrics collection services to compute total latency of all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about using transition times of different conditions to measure duration of operations. For phases that must be performed in series and that have a guaranteed stable ordering, this would seem okay, but it seems like network configuration could happen in parallel with other steps, maybe as an optimization in the future. Also, in the future, if there needs to be some step after network it looks like it would mess up metrics collection (i.e. if X happens after network configuration, you could end up measuring the latency of network configuration + latency of X).
This also prevents other controllers from acting on sandbox status as | ||
mentioned in User Stories section. | ||
|
||
### Report sandbox creation stages using Kubelet tracing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something along these lines be viable? If we continued to add monitoring of pod sandbox latencies for other things over time, how far can we go? If we want more granular monitoring, would we be okay with having 10 or even 50 conditions? My hunch is that some type of tracing approach, or some other way to export metrics data scales much better.
runtime and associated runtime handlers, CNI plugins, etc. The duration between | ||
`lastTransitionTime` field of the `PodHasNetwork` condition (with `status` set | ||
to `true` for a pod for the first time) and the existing `PodScheduled` | ||
condition will allow metrics collection services to compute total latency of all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we proposing any in-tree metrics collection? Or is the intention make it possible for 3rd party metrics collection of network configuration? (I'm assuming the later, but wanted to double check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I am sorry I didn't see this one before it went in for alpha. I have some real reservations about this at multiple layers.
First, this makes an assumption that I don't think necessarily holds - that network is the last step in setup. You're calling it "has network" but you MEAN "sandbox is done". Why are we mis-naming it?
Second, it's possible that we may yet support multi-network pods in Kube. There's a discussion going in sig-net. If/when that lands, does this now mean "has all networks" or "has one network"?
Third, I am not a fan of Conditions, in general, and this feels like a bit of a force-fit. All thru reading I kept asking "why not use a new field which exists FOR THIS PURPOSE? Conditions are supposed to be orthogonal - not related to each other. Describing them in a sequence means you have defines a state-machine, and that makes a TERRIBLE API.
I really like the main idea (or at least what I interpret as the main idea) - publish information about how long pod startup took and which components ate how much of that time. This feels like a partial answer at best.
Stepping back, we have finalizers and we have readiness gates, which are kind-of oppiste ends, but they both cover time while the pod is running. This KEP is covering time before it is running, but the pattern - "a set of arbitrary check-boxes need to be checked" seems to hold. Can we / should we embrace that?
For example, imagine we kept a list of "steps" with timestamps. Si,ilar to conditions but actually designed for this, no baggage. kubelet could look at the pod at admission and build up a list of the steps it knows (which volumes, which networks, etc). It could record the start and stop times (or just deltas) for each step. This does not assume linearity or concurrency - that's up to kubelet to manage. It's also extensible - so new components could add steps without breaking the published state-machine (which is what this really becomes, as-is).
Lastly - I am still confused about "Initialized". Does that mean "ready to run non-init-containers? It sounds like it is muddy. Would be nice to have a diagram showing the actual timeline, e.g.
|--<scheduled>-|-<admitted>-|-<setup begun>-|-<components...>-|-<setup complete>-|-<initcontainers begun>-|-<initcontainers done>-|-<running>-|-<deleted>-|-<finalizers>--|
So I do not think this KEP shoudl proceed beyond alpha as it is written, at least I feel like I need to understand a whole lot better why these tradeoffs are correct. I am happy to get on a meeting some time (kubecon?) and discuss.
container runtime handlers. | ||
|
||
Conclusion of the creation of the pod sandbox is marked by the presence of a | ||
sandbox with networking configured. A new dedicated condition marking this - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is "has network" necessarily the last step? Is that part of a contract I do not know about?
What I mean is: It seems like theres a list (or graph?) of things that have to happen as a pod starts. That list may not be knowable a priori (e.g. a "special" RuntimeClass might maybe have extra steps that a "regular one wouldn't") so I am immediately wary of this sort of statement.
`lastTransitionTime` field of the `PodHasNetwork` condition (with `status` set | ||
to `true` for a pod for the first time) and the existing `PodScheduled` | ||
condition will allow metrics collection services to compute total latency of all | ||
the components involved in pod sandbox creation as an SLI. Cluster operators can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of creating a way to measure the contribution of each "step" to the total startup time of a pod. That said, are all of the steps guranteed to be linear? Or can they happen in parallel (e.g. I can pull an image while I mount volumes and configure network all at the same time)? How do we represent that?
Timestamps around completion of pod sandbox creation may be surfaced as a | ||
dedicated field in the pod status rather than a pod condition. However, since | ||
the successful creation of pod sandbox is essentially a "milestones" in the life | ||
of a pod (similar to Scheduled, Ready, etc), pod conditions is the ideal place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Have argued in the past that "Ready" is a terrible use of conditions, so I have a problem with using "we did that" as an excuse to do it again. Past mistakes are still mistakes.
|
||
Each infrastructural plugin that Kubelet calls out to (in the process of setting | ||
up a pod sandbox) can mark start and completion timestamps on the pod as | ||
conditions. This approach would be similar to how readiness gates work today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite buying this - if kubelet is the internal "switchboard" for running messages between plugins, that feels like a purely internal thing - don't make these plugins THEMSELVES update the "pre-readiness gates", make kubelet do it.
@thockin: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
FTR: I am not expecting dramatic action on 1.26, given timing, but this CAN NOT proceed to beta without discussion. |
ack, @thockin. Appreciate the review and the thoughts. I will DM to find a time to meet (virtually or during Kubecon). I really like the idea of |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@ddebroy what did we decide to do about this? |
I didn't get to review 3085 and there is no good way to add comments to a merged KEP, so I am trying this approach.