KEP-5598: Extend opportunistic batching with rescoring#6039
Conversation
|
Hi @romanbaron. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
singh1203
left a comment
There was a problem hiding this comment.
Hey @romanbaron, thanks for including me. I went through the full diff carefully. Overall, the rescoring design is clean and well written. I just had one question, and thanks for answering it. 🙇
Overall, it looks good to me.
afbce01 to
38281e1
Compare
eda4b71 to
5734340
Compare
e9c449d to
703870e
Compare
703870e to
8119e15
Compare
|
The PR is now rebased on top of the refactored KEP and only includes Rescore-related changes. |
|
/ok-to-test |
Co-authored-by: Toru Komatsu <k0ma@utam0k.jp>
7d19290 to
f4cd3a0
Compare
macsko
left a comment
There was a problem hiding this comment.
The KEP looks good overall, thanks for doing this!
|
I've read through it, and it looks good. This is a great update! Thanks! |
|
/assign |
dom4ha
left a comment
There was a problem hiding this comment.
The approach with rescoring LGTM. I think we could optimize it further with caching cycleState as well, but I'd treat it as optional implementation detail, not as the core element of the solution, since it assumes all plugins implement it.
Leaving the final approval on @sanposhiho
|
|
||
| 1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host | ||
| additional pods. | ||
| 2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The |
There was a problem hiding this comment.
I think we could cache the cycleState as well to avoid rerunning PreFilter. All we'd need to do is to run PreFilterExtension.AddPod on a pod assumed in the previous cycle.
There was a problem hiding this comment.
I think not all plugins that have the PreFilter implement the PreFilterExtension.AddPod. AddPod is expected to be used in the preemption, so plugins that know the preemption doesn't matter for them, can skip defining AddPod/RemovePod. At some point we could try to cache the cycleState where possible, but I wouldn't consider it for now
There was a problem hiding this comment.
Yeah, I already mentioned it as optional, so it's not a blocking comment for this KEP.
Anyway, we could make PreFilterExtension implementation either required in Signature or at least detect whether it's implemented before we can use it.
It should give the Opportunistic Batching a significant boost as PreFilter scales with a cluster size, but we treat it immutable anyway within opportunistic batching cache. So I think it's worth to consider it not in this KEP update, but in the future ones.
wojtek-t
left a comment
There was a problem hiding this comment.
The PRR part (unsurprisingly) still looks fine.
| rescoring. | ||
| 4. **Last chosen node feasibility check:** Filter plugins are run against the node chosen in the | ||
| previous cycle. Two outcomes are possible: | ||
| - **Infeasible:** The node is full (one-pod-per-node case). The node is discarded and batch |
There was a problem hiding this comment.
nit: it doesn't necessary mean the node is full or really be pod-per-node - it may e.g. be result of using node-port where scheduling more pods from our batch is impossible
|
|
||
| 1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host | ||
| additional pods. | ||
| 2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The |
There was a problem hiding this comment.
The way I read it is that we recompute PreFilter state as in each new cycle. I had a proposal to reuse (cache) the cycleState of the previous cycle in the comment #6039 (comment), but only as a future extension
|
|
||
| ## Design Details | ||
|
|
||
| ### Pod signature |
There was a problem hiding this comment.
we must make sure unsignable is returned when cross-node scoring scheduling is used on pods (preferred pod affinity, preferred topology spread, etc)
|
|
||
| 1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host | ||
| additional pods. | ||
| 2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The |
There was a problem hiding this comment.
The way I read it is that we recompute PreFilter state as in each new cycle. I had a proposal to reuse (cache) the cycleState of the previous cycle in the comment #6039 (comment), but only as a future extension
|
|
||
| 1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host | ||
| additional pods. | ||
| 2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The |
There was a problem hiding this comment.
Scoreis called for each scoring plugin against node A using the currentCycleState.
But, CycleState doesn't have any state calculated at PreScore at this point?
There was a problem hiding this comment.
A bit of more summery of our internal discussion as a record (not only about this prescore stuff)
Running PreScore of all plugins is not very cheap, and we definitely need to discuss how we can make it efficient in the next release cycle. Very likely, we need to reuse the cyclestate from the last pod somehow.
Also, another discussion was whether we want to rerun Score() for all cached nodes. The current KEP says we just run Score() for the node that the last pod selected. But, that means we don't support the support of cross-node scoring. The current opportunistic batching doesn't support the default scheduling config because the default scheduling config has a preferred topology spread, which is a cross-node scoring. So, actually supporting such cross-node scoring is important to make this feature more widely usable.
We have two options to discuss in the next release cycle-
- If we drop the support of cross-node scoring for now, we can implement
PreScoreExtension.AddPodto modify the cyclestate of the last pod. Then, the flow will be PreScoreExtension.AddPod(last pod) -> Score(selected node) -> NormalizeScore(all cached nodes) - If we want to support cross-node scoring, there are two options further
a. PreScoreExtension.AddPod(last pod) -> Score(all cached nodes) -> NormalizeScore(all cached nodes)
b. implement a new ext pointRescore(), which will do (a) essentially but more efficiently.
We don't decide it now, but will discuss in the next release KEP cycle
|
/approve @romanbaron is on vacation and cannot modify the KEP for this comment- Approving the PR because all the leads are on the same page. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: romanbaron, sanposhiho, singh1203 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add rescoring to handle multi-pod-per-node workloads: when the last chosen node remains feasible, rescore it in-place and continue batching rather than flushing the cache.
AI tooling was used to assist in preparing this PR. All changes have been reviewed and verified by the author.