Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: DNM: Hack on kep 3085 #3501

Closed
wants to merge 2 commits into from
Closed

Conversation

thockin
Copy link
Member

@thockin thockin commented Sep 9, 2022

I didn't get to review 3085 and there is no good way to add comments to a merged KEP, so I am trying this approach.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 9, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thockin
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval by writing /assign @derekwaynecarr in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 9, 2022
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 9, 2022
@thockin thockin marked this pull request as draft September 9, 2022 20:35
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 9, 2022
controller for managing pods that refer to PVCs associated with node local
storage (e.g. Rook-Ceph) may decide to recreate PVCs (based on a specified PVC
template in the custom resource the controller is managing) if the sandbox
creation is repeatedly failing to complete, indicated by the new `PodHasNetwork`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should have a status for the sandbox overall instead of specifically "has network" ?

I see #3087 (review), but it still seems like the motivation is knowing if the sandbox is ready and because the sandbox is somewhat nebulous I think it's wrong to couple directly to "has network" when we're really looking for "CRI finished sandbox setup"?

around pod sandbox creation latency, surfacing the `PodHasNetwork` condition in
pod status will allow a metric collection service to directly access the
relevant data without requiring the ability to collect and parse OpenTelemetry
traces. As mentioned in the User Stories, popular community managed services
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the user is looking for more granular data on timings, surely a common tracing standard is the answer?

How granular would we eventually go with conditions vs tracing?

`PodHasNetwork` condition reporting `true`) but the pod's containers fail to
become ready. Further details of this is covered in a [user story](#story-2-consuming-podhasnetwork-condition-in-a-controller) below.

When a pod's sandbox no longer exists, the `status` of `PodHasNetwork` condition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see PodHasNetwork my first thought was that this is a "is the network working right now?" type indicator. Should this be named something like PodNetworkConfigured?

runtime and associated runtime handlers, CNI plugins, etc. The duration between
`lastTransitionTime` field of the `PodHasNetwork` condition (with `status` set
to `true` for a pod for the first time) and the existing `PodScheduled`
condition will allow metrics collection services to compute total latency of all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about using transition times of different conditions to measure duration of operations. For phases that must be performed in series and that have a guaranteed stable ordering, this would seem okay, but it seems like network configuration could happen in parallel with other steps, maybe as an optimization in the future. Also, in the future, if there needs to be some step after network it looks like it would mess up metrics collection (i.e. if X happens after network configuration, you could end up measuring the latency of network configuration + latency of X).

This also prevents other controllers from acting on sandbox status as
mentioned in User Stories section.

### Report sandbox creation stages using Kubelet tracing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something along these lines be viable? If we continued to add monitoring of pod sandbox latencies for other things over time, how far can we go? If we want more granular monitoring, would we be okay with having 10 or even 50 conditions? My hunch is that some type of tracing approach, or some other way to export metrics data scales much better.

runtime and associated runtime handlers, CNI plugins, etc. The duration between
`lastTransitionTime` field of the `PodHasNetwork` condition (with `status` set
to `true` for a pod for the first time) and the existing `PodScheduled`
condition will allow metrics collection services to compute total latency of all
Copy link
Contributor

@jpbetz jpbetz Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we proposing any in-tree metrics collection? Or is the intention make it possible for 3rd party metrics collection of network configuration? (I'm assuming the later, but wanted to double check)

Copy link
Member Author

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am sorry I didn't see this one before it went in for alpha. I have some real reservations about this at multiple layers.

First, this makes an assumption that I don't think necessarily holds - that network is the last step in setup. You're calling it "has network" but you MEAN "sandbox is done". Why are we mis-naming it?

Second, it's possible that we may yet support multi-network pods in Kube. There's a discussion going in sig-net. If/when that lands, does this now mean "has all networks" or "has one network"?

Third, I am not a fan of Conditions, in general, and this feels like a bit of a force-fit. All thru reading I kept asking "why not use a new field which exists FOR THIS PURPOSE? Conditions are supposed to be orthogonal - not related to each other. Describing them in a sequence means you have defines a state-machine, and that makes a TERRIBLE API.

I really like the main idea (or at least what I interpret as the main idea) - publish information about how long pod startup took and which components ate how much of that time. This feels like a partial answer at best.

Stepping back, we have finalizers and we have readiness gates, which are kind-of oppiste ends, but they both cover time while the pod is running. This KEP is covering time before it is running, but the pattern - "a set of arbitrary check-boxes need to be checked" seems to hold. Can we / should we embrace that?

For example, imagine we kept a list of "steps" with timestamps. Si,ilar to conditions but actually designed for this, no baggage. kubelet could look at the pod at admission and build up a list of the steps it knows (which volumes, which networks, etc). It could record the start and stop times (or just deltas) for each step. This does not assume linearity or concurrency - that's up to kubelet to manage. It's also extensible - so new components could add steps without breaking the published state-machine (which is what this really becomes, as-is).

Lastly - I am still confused about "Initialized". Does that mean "ready to run non-init-containers? It sounds like it is muddy. Would be nice to have a diagram showing the actual timeline, e.g.

|--<scheduled>-|-<admitted>-|-<setup begun>-|-<components...>-|-<setup complete>-|-<initcontainers begun>-|-<initcontainers done>-|-<running>-|-<deleted>-|-<finalizers>--|

So I do not think this KEP shoudl proceed beyond alpha as it is written, at least I feel like I need to understand a whole lot better why these tradeoffs are correct. I am happy to get on a meeting some time (kubecon?) and discuss.

container runtime handlers.

Conclusion of the creation of the pod sandbox is marked by the presence of a
sandbox with networking configured. A new dedicated condition marking this -
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "has network" necessarily the last step? Is that part of a contract I do not know about?

What I mean is: It seems like theres a list (or graph?) of things that have to happen as a pod starts. That list may not be knowable a priori (e.g. a "special" RuntimeClass might maybe have extra steps that a "regular one wouldn't") so I am immediately wary of this sort of statement.

`lastTransitionTime` field of the `PodHasNetwork` condition (with `status` set
to `true` for a pod for the first time) and the existing `PodScheduled`
condition will allow metrics collection services to compute total latency of all
the components involved in pod sandbox creation as an SLI. Cluster operators can
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of creating a way to measure the contribution of each "step" to the total startup time of a pod. That said, are all of the steps guranteed to be linear? Or can they happen in parallel (e.g. I can pull an image while I mount volumes and configure network all at the same time)? How do we represent that?

Timestamps around completion of pod sandbox creation may be surfaced as a
dedicated field in the pod status rather than a pod condition. However, since
the successful creation of pod sandbox is essentially a "milestones" in the life
of a pod (similar to Scheduled, Ready, etc), pod conditions is the ideal place
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Have argued in the past that "Ready" is a terrible use of conditions, so I have a problem with using "we did that" as an excuse to do it again. Past mistakes are still mistakes.


Each infrastructural plugin that Kubelet calls out to (in the process of setting
up a pod sandbox) can mark start and completion timestamps on the pod as
conditions. This approach would be similar to how readiness gates work today.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite buying this - if kubelet is the internal "switchboard" for running messages between plugins, that feels like a purely internal thing - don't make these plugins THEMSELVES update the "pre-readiness gates", make kubelet do it.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2022
@k8s-ci-robot
Copy link
Contributor

@thockin: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@thockin thockin self-assigned this Sep 29, 2022
@thockin
Copy link
Member Author

thockin commented Sep 29, 2022

FTR: I am not expecting dramatic action on 1.26, given timing, but this CAN NOT proceed to beta without discussion.

@ddebroy
Copy link
Member

ddebroy commented Sep 29, 2022

ack, @thockin. Appreciate the review and the thoughts. I will DM to find a time to meet (virtually or during Kubecon).

I really like the idea of a list of "steps" with timestamps. Similar to conditions but actually designed for this, no baggage. We indeed tried to force-fit our requirement into pod conditions mainly based on the existing Initialized pod condition but deciding not to alter that for compatibility concerns.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2022
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2023
@thockin
Copy link
Member Author

thockin commented Jan 14, 2023

@ddebroy what did we decide to do about this?

@thockin thockin closed this Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

8 participants