KEP-1040: Start drafting GA graduation criteria for API Priority and Fairness #3155

MikeSpreitzer · 2022-01-17T20:54:13Z

One-line PR description:
Start drafting GA graduation criteria for API Priority and Fairness

Other comments: the beta criteria are a but musty, but that's water under the bridge now.

MikeSpreitzer · 2022-01-18T18:22:11Z

wojtek-t · 2022-01-19T07:15:56Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+
+- Satisfaction with LIST and WATCH support
+- APF allows us to disable client-side rate limiting (or we know the reason why not)
+- Satisfaction that the API is sufficient to support borrowing between priority levels


I would say that maybe not necessary sufficient, but we are convinced we can "extend that in backward compatible way" (i.e. we will not have to change some details fields, validations, defaulting, etc. for that purpose)

MikeSpreitzer · 2022-02-04T20:28:30Z

The force-push to 669f968 makes a change along the lines @wojtek-t suggested in review.

wojtek-t · 2022-02-07T13:52:11Z

/lgtm

We can always extend it when we really will be targeting GA

/assign @deads2k @lavalamp

MikeSpreitzer · 2022-02-07T19:02:51Z

FYI, here is the description of how to add a field to an API object without bumping the version: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#new-field-in-existing-api-version

MikeSpreitzer · 2022-02-07T19:16:38Z

@MikeSpreitzer

deads2k · 2022-04-06T20:35:30Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+GA:
+
+- Satisfaction with LIST and WATCH support
+- APF allows us to disable client-side rate limiting (or we know the reason why not)


If we cannot disable client-side rate limiting, why would we consider the feature complete?

In my opinion there are two aspects:

whether without client-side rate-limiting the kube-apiserver (and etcd) are falling over

whether without client-side rate-limiting some other aspects of the systems are falling over because they are not able to keep up with load and accumulating huge backlog

Let's make a specific example. Let's say that endpoints controller can keep up with 100pods/s throughput in large enough clusters. Now, if we remove client-side rate-limiting completely, then e.g. scheduler would be able to scheduler say 500pods/s and endpoints controller will be accumulating backlog - thus network programming latency will go extremely high.

From that perspective, I think that APF does that job (and we would be able to get rid of client-side rate-limiting - there is nothing more to do there), but I still will be reluctant to actually getting rid of it from all components to avoid hurting other aspects of the system. But the improvements needed would actually be in completely different components, so shouldn't block APF graduation.

That's a compelling reason. I'd like for that explanation or something similar to be expressed here. Priority and fairness, with client-side rate limiting disabled, must be sufficient for the kube-apiserver to survive. If other components require additional client-side rate limiting, that will not stop our GA, but the kube-apiserver must survive without it.

@MikeSpreitzer - can you please extend the text to incorporate it?

Interesting distinction here. In short, this harkens back to the point I have been making in other contexts: Kubernetes can be seen as two layers. The lower layer is an extensible API service, and the higher layer is built on that API service and consists of the resource definitions and controllers for managing containerized workload. APF is focused on the lower layer, but Kubernetes will not be safe from overloads until the higher layer also has protections.

I agree with the theory but I'm skeptical that we actually have some slow controller like this?

I'm also skeptical that this is how people would choose to solve a problem -- I've heard of tons of latency problems and I can't recall ever hearing someone suggest slowing down everything else in the system so that the slow thing can keep up.

And if you really want to slow down the e.g. scheduler to keep endpoints from looking bad, APF does let you do that...

Longer-term I agree with that.

Shorter-term I actually disagree.
Slow networking controllers (that complete can't keep up with other controllers) can actually cause an outage to your applications.
e.g. imagine that we can do a rolling upgrade super fast, but networking doesn't keep up and when the last old pod is actually deleted, the newer ones are not yet added to the LB mechanism. That means a completely outage of your service.

So while I'm definitely all for disabling client-side rate-limits eventually, I'm lacking a lot of confidence here and I would be opposed doing that before we prove it works. And at the same time, blocking P&F graduation on that doesn't sound desired.

Wasn't there a pod readiness thing done a while ago to address that load balancer scenario?

That doesn't necessary mean everyone is using that. Recommended patterns are not necessarily quickly adopted by many users.

deads2k · 2022-04-06T20:36:24Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+
+- Satisfaction with LIST and WATCH support
+- APF allows us to disable client-side rate limiting (or we know the reason why not)
+- Satisfaction that the API can be extended in a backward-compatible way to support borrowing between priority levels


I'm not clear on why we would consider the feature GA without borrowing being implemented.

We all are using P&F in production now and we've all seen cases where it actually prevented failover. So we have proofs that it already provides significant value even without borrowing.
In my personal opinion borrowing is an extensions/feature on top of the basic P&F and (while I fully agree that we should start working on that now-ish), I don't think we should actually block the GA of the feature as a whole on it.
WDYT?

Without borrowing, we have reservations about really compressing the number of concurrent requests to a value small enough to keep clusters near the edge. I'd like to be able to shrink the number of concurrent requests, but that has significant negative impacts without borrowing unused priority.

cc @tkashem

I fully agree with the above. And I 100% agree we should do that.
My only concern is - why this should be a GA release blocker. All of us are effectively using P&F in production. And I think at this point no-one will switch back to max-in-flight as it's already visible worse.
So the way I'm looking at it is "P&F is missing an important extension", not "P&F is not GA-quality".

We discussed it in the APF meeting today and decided to update the criteria to say borrowing is working. One of the leading considerations is the expectation that working borrowing will lead to significant changes to configuration. This would be disruptive enough to deserve corresponding from beta to GA.

wojtek-t · 2022-04-12T05:56:26Z

/lgtm

lavalamp · 2022-04-12T16:53:16Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+GA:
+
+- Satisfaction with LIST and WATCH support.
+- API annotations properly support strategic merge patch.


What does this mean?

This means we botched the field tags when we first wrote them.

Oh, you literally mean the tags on fields? OK

(I don't think this is a GA requirement personally, it's more a tactical need until we get SSA on in integration tests)

Compare with https://github.com/kubernetes/api/blob/v0.23.5/flowcontrol/v1beta2/types.go#L353-L356 with types that do it right.

lavalamp · 2022-04-12T16:59:48Z

/approve

k8s-ci-robot · 2022-04-12T17:00:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, MikeSpreitzer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-api-machinery/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 17, 2022

k8s-ci-robot requested review from deads2k and lavalamp January 17, 2022 20:54

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 17, 2022

MikeSpreitzer mentioned this pull request Jan 17, 2022

Add patch annotations to APF API types in next version kubernetes/kubernetes#107574

Closed

k8s-ci-robot requested review from tkashem and wojtek-t January 18, 2022 18:22

wojtek-t reviewed Jan 19, 2022

View reviewed changes

kikisdeliveryservice changed the title ~~Start drafting GA graduation criteria for API Priority and Fairness~~ KEP-1040: Start drafting GA graduation criteria for API Priority and Fairness Feb 3, 2022

Start drafting GA graduation criteria for APF

669f968

MikeSpreitzer force-pushed the apf-ga-criteria branch from 0ccf823 to 669f968 Compare February 4, 2022 20:27

k8s-ci-robot assigned wojtek-t Feb 7, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2022

deads2k reviewed Apr 6, 2022

View reviewed changes

Revise GA criteria according to recent discussions

d3470ec

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2022

lavalamp reviewed Apr 12, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2022

k8s-ci-robot merged commit ff71f81 into kubernetes:master Apr 12, 2022

k8s-ci-robot added this to the v1.24 milestone Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-1040: Start drafting GA graduation criteria for API Priority and Fairness #3155

KEP-1040: Start drafting GA graduation criteria for API Priority and Fairness #3155

MikeSpreitzer commented Jan 17, 2022 •

edited by kikisdeliveryservice

Loading

MikeSpreitzer commented Jan 18, 2022

wojtek-t Jan 19, 2022

MikeSpreitzer Feb 4, 2022

MikeSpreitzer commented Feb 4, 2022

wojtek-t commented Feb 7, 2022

MikeSpreitzer commented Feb 7, 2022

MikeSpreitzer commented Feb 7, 2022

deads2k Apr 6, 2022

wojtek-t Apr 7, 2022

deads2k Apr 7, 2022

wojtek-t Apr 8, 2022

MikeSpreitzer Apr 11, 2022

lavalamp Apr 12, 2022

wojtek-t Apr 13, 2022

MikeSpreitzer Apr 13, 2022

wojtek-t Apr 13, 2022

deads2k Apr 6, 2022

wojtek-t Apr 7, 2022

deads2k Apr 7, 2022

wojtek-t Apr 8, 2022

MikeSpreitzer Apr 11, 2022 •

edited

Loading

wojtek-t commented Apr 12, 2022

lavalamp Apr 12, 2022

MikeSpreitzer Apr 12, 2022

lavalamp Apr 12, 2022

lavalamp Apr 12, 2022

MikeSpreitzer Apr 12, 2022

lavalamp commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

KEP-1040: Start drafting GA graduation criteria for API Priority and Fairness #3155

KEP-1040: Start drafting GA graduation criteria for API Priority and Fairness #3155

Conversation

MikeSpreitzer commented Jan 17, 2022 • edited by kikisdeliveryservice Loading

MikeSpreitzer commented Jan 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Feb 4, 2022

wojtek-t commented Feb 7, 2022

MikeSpreitzer commented Feb 7, 2022

MikeSpreitzer commented Feb 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

wojtek-t commented Apr 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

MikeSpreitzer commented Jan 17, 2022 •

edited by kikisdeliveryservice

Loading

MikeSpreitzer Apr 11, 2022 •

edited

Loading