Skip to content

KEP-5598: Extend opportunistic batching with rescoring#6039

Merged
k8s-ci-robot merged 5 commits into
kubernetes:masterfrom
romanbaron:opportunistic-batching-rescore
Jun 17, 2026
Merged

KEP-5598: Extend opportunistic batching with rescoring#6039
k8s-ci-robot merged 5 commits into
kubernetes:masterfrom
romanbaron:opportunistic-batching-rescore

Conversation

@romanbaron

Copy link
Copy Markdown
Contributor
  • One-line PR description:
    Add rescoring to handle multi-pod-per-node workloads: when the last chosen node remains feasible, rescore it in-place and continue batching rather than flushing the cache.
  • Other comments:
    AI tooling was used to assist in preparing this PR. All changes have been reviewed and verified by the author.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 29, 2026
@k8s-ci-robot k8s-ci-robot requested a review from dom4ha April 29, 2026 08:45
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 29, 2026
@k8s-ci-robot k8s-ci-robot requested a review from macsko April 29, 2026 08:45
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 29, 2026
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 29, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Scheduling Apr 29, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @romanbaron. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 29, 2026

@singh1203 singh1203 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @romanbaron, thanks for including me. I went through the full diff carefully. Overall, the rescoring design is clean and well written. I just had one question, and thanks for answering it. 🙇
Overall, it looks good to me.

Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md
Comment thread keps/sig-scheduling/5598-opportunistic-batching/kep.yaml Outdated
Comment thread keps/sig-scheduling/5598-opportunistic-batching/kep.yaml Outdated
Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md Outdated
Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md
Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md Outdated
@romanbaron romanbaron force-pushed the opportunistic-batching-rescore branch from afbce01 to 38281e1 Compare May 4, 2026 08:23
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 4, 2026
@romanbaron romanbaron force-pushed the opportunistic-batching-rescore branch from eda4b71 to 5734340 Compare May 4, 2026 13:59
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 4, 2026
@romanbaron romanbaron force-pushed the opportunistic-batching-rescore branch 2 times, most recently from e9c449d to 703870e Compare May 4, 2026 16:17
@romanbaron romanbaron force-pushed the opportunistic-batching-rescore branch from 703870e to 8119e15 Compare May 4, 2026 16:43
@romanbaron

Copy link
Copy Markdown
Contributor Author

The PR is now rebased on top of the refactored KEP and only includes Rescore-related changes.

@macsko

macsko commented May 5, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2026
Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md
Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md
Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md Outdated

@macsko macsko left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP looks good overall, thanks for doing this!

Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md Outdated
@utam0k

utam0k commented May 8, 2026

Copy link
Copy Markdown
Member

I've read through it, and it looks good. This is a great update! Thanks!

@sanposhiho

Copy link
Copy Markdown
Member

/assign

@pacoxu pacoxu mentioned this pull request Jun 4, 2026
18 tasks
@helayoty helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Jun 16, 2026

@dom4ha dom4ha left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach with rescoring LGTM. I think we could optimize it further with caching cycleState as well, but I'd treat it as optional implementation detail, not as the core element of the solution, since it assumes all plugins implement it.

Leaving the final approval on @sanposhiho


1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host
additional pods.
2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could cache the cycleState as well to avoid rerunning PreFilter. All we'd need to do is to run PreFilterExtension.AddPod on a pod assumed in the previous cycle.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not all plugins that have the PreFilter implement the PreFilterExtension.AddPod. AddPod is expected to be used in the preemption, so plugins that know the preemption doesn't matter for them, can skip defining AddPod/RemovePod. At some point we could try to cache the cycleState where possible, but I wouldn't consider it for now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I already mentioned it as optional, so it's not a blocking comment for this KEP.

Anyway, we could make PreFilterExtension implementation either required in Signature or at least detect whether it's implemented before we can use it.

It should give the Opportunistic Batching a significant boost as PreFilter scales with a cluster size, but we treat it immutable anyway within opportunistic batching cache. So I think it's worth to consider it not in this KEP update, but in the future ones.

@wojtek-t wojtek-t left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRR part (unsurprisingly) still looks fine.

rescoring.
4. **Last chosen node feasibility check:** Filter plugins are run against the node chosen in the
previous cycle. Two outcomes are possible:
- **Infeasible:** The node is full (one-pod-per-node case). The node is discarded and batch

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it doesn't necessary mean the node is full or really be pod-per-node - it may e.g. be result of using node-port where scheduling more pods from our batch is impossible

@sanposhiho sanposhiho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf is the biggest concern - we already get a report from user, but this change will increase the overhead of this feature more, without mentioning any solution about the overhead problem

Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md

1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host
additional pods.
2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read it is that we recompute PreFilter state as in each new cycle. I had a proposal to reuse (cache) the cycleState of the previous cycle in the comment #6039 (comment), but only as a future extension

Comment thread keps/sig-scheduling/5598-opportunistic-batching/README.md

## Design Details

### Pod signature

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must make sure unsignable is returned when cross-node scoring scheduling is used on pods (preferred pod affinity, preferred topology spread, etc)


1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host
additional pods.
2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read it is that we recompute PreFilter state as in each new cycle. I had a proposal to reuse (cache) the cycleState of the previous cycle in the comment #6039 (comment), but only as a future extension


1. Node A is re-inserted into the cached sorted list. Since it is still feasible, it can host
additional pods.
2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Score is called for each scoring plugin against node A using the current CycleState.

But, CycleState doesn't have any state calculated at PreScore at this point?

@sanposhiho sanposhiho Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed internally, and we agreed that we will run PreScore of all plugins before this. This is not the optimized way, but we can consider a further optimization in the next release cycle

/cc @dom4ha @macsko

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PreFilter

PreScore?

@sanposhiho sanposhiho Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of more summery of our internal discussion as a record (not only about this prescore stuff)

Running PreScore of all plugins is not very cheap, and we definitely need to discuss how we can make it efficient in the next release cycle. Very likely, we need to reuse the cyclestate from the last pod somehow.

Also, another discussion was whether we want to rerun Score() for all cached nodes. The current KEP says we just run Score() for the node that the last pod selected. But, that means we don't support the support of cross-node scoring. The current opportunistic batching doesn't support the default scheduling config because the default scheduling config has a preferred topology spread, which is a cross-node scoring. So, actually supporting such cross-node scoring is important to make this feature more widely usable.

We have two options to discuss in the next release cycle-

  1. If we drop the support of cross-node scoring for now, we can implement PreScoreExtension.AddPod to modify the cyclestate of the last pod. Then, the flow will be PreScoreExtension.AddPod(last pod) -> Score(selected node) -> NormalizeScore(all cached nodes)
  2. If we want to support cross-node scoring, there are two options further
    a. PreScoreExtension.AddPod(last pod) -> Score(all cached nodes) -> NormalizeScore(all cached nodes)
    b. implement a new ext point Rescore(), which will do (a) essentially but more efficiently.

We don't decide it now, but will discuss in the next release KEP cycle

@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko June 17, 2026 11:02
@sanposhiho

Copy link
Copy Markdown
Member

/approve
/lgtm

@romanbaron is on vacation and cannot modify the KEP for this comment-

#6039 (comment)

Approving the PR because all the leads are on the same page.
We will ask him to edit it based on the agreement once he's back.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: romanbaron, sanposhiho, singh1203

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2026
@k8s-ci-robot k8s-ci-robot merged commit e2f7ae8 into kubernetes:master Jun 17, 2026
4 checks passed
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in SIG Scheduling Jun 17, 2026
@k8s-ci-robot k8s-ci-robot added this to the v1.37 milestone Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

9 participants