Skip to content

Conversation

KobayashiD27
Copy link
Contributor

  • One-line PR description: updating KEP docs
  • Other comments:

This PR updates KEP-5007 to clarify the scope, motivation, and design of the proposed BindingConditions mechanism.

Key updates include:

  • Rewriting the Motivation section to emphasize general applicability beyond CDI, while retaining CDI as a motivating example.
  • Refining the Goals section to focus on readiness-aware binding and condition-based scheduling logic, rather than specific device types.
  • Rewriting the prioritization logic in the Goals section to describe general scheduling behavior based on BindingConditions.

These changes aim to make the KEP easier to review and better aligned with the feedback received during discussion. Feedback is welcome!

related to : kubernetes/kubernetes#130160

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @KobayashiD27. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 26, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling May 26, 2025
@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko May 26, 2025 10:04
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2025
@KobayashiD27
Copy link
Contributor Author

cc @pohly @dom4ha
I've updated the KEP description based on the PR discussion and the current implementation. Could you please take a look when you have a moment?

@KobayashiD27
Copy link
Contributor Author

@pohly @johnbelamaric @dom4ha @macsko
Hi, could you take a look please?

@pohly pohly mentioned this pull request Jun 6, 2025
6 tasks
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is getting closer to being ready for merging. Some suggestions and one API gap.

In scenarios where attachment occurs after scheduling, there is a risk that the resource cannot be attached at the time of attachment, causing the container to remain in the "Container Creating" state.
The mechanism is not tied to any specific hardware model or infrastructure.
It can support a wide range of use cases, including:
- Fabric-attached GPUs that require dynamic attachment via PCIe or CXL switches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for mentioning CXL. This has indeed come up in recent discussions, and this KEP is relevant for it.

By having the scheduler wait for the fabric device to be attached, we can reschedule the pod if the attachment fails.
This approach is superior because it avoids unnecessary waiting and allows for immediate rescheduling.
While the original motivation came from fabric-attached devices, the mechanism is designed to be broadly applicable.
It can support other scenarios where resource readiness is asynchronous or failure-prone, such as remote accelerators or gang scheduling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also 👍 for gang scheduling.

This KEP is a good building block for exploring advanced use cases.

It allows the scheduler to make binding decisions based on up-to-date readiness information, improving reliability and avoiding premature binding.

While this proposal supports fabric-attached devices, it is not limited to them.
The mechanism is designed to be general and can support other use cases where resource readiness is asynchronous or failure-prone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been said a few times. I think you can remove the entire paragraph here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this paragraph.

External controllers (e.g., composable DRA controllers) are responsible for monitoring the `ResourceClaim` and updating the condition statuses as device preparation progresses.
This coordination allows the scheduler to make informed binding decisions without requiring tight coupling between the scheduler and device-specific logic.

### Handling ResourceSlices Upon Failure of Attachment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same section is present a second time below. I think this here is a better place for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed duplicate sections.

```

#### AllocatedDeviceStatus Enhancements
#### DeviceRequestAllocationResult Enhancements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section above for adding AllocationTimestamp to AllocationResult, it's currently missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right, I'll update it to include a description about AllocationTimestamp.

@github-project-automation github-project-automation bot moved this from Needs Triage to Needs Review in SIG Scheduling Jun 6, 2025
2. **Attribute Information for Fabric Devices**:
Add attribute information that clearly distinguishes fabric devices requiring attachment.
This will help in accurately identifying and managing these devices within the Kubernetes environment.
3. **Prioritize Devices Based on Readiness**:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain if we can make it a general rule without going into details of specific conditions. Do you mean it's always true that devices will lower number of conditions should be favorable or just assume that lack of conditions means the devices does not need attachment?

I think in general case it's not that simple and the number of conditions is rather an implementation detail. I think that scheduler should have more explicit information denoting a need for attachment (scale up) and possibly some information how much time it should take.

I'd rather think whether the timeout wasn't a better indicator which could be taken into account here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple heuristic is that without binding conditions, there's no delay and thus such devices are "better" than devices with. I agree that we can improve this further by sorting by timeout: devices without binding conditions have a zero timeout, devices without an explicit timeout the default timeout, and others the specified one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, having none vs something seems good enough heuristic for the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.
Yes, the current logic prioritizes devices without BindingConditions. This is because devices without conditions are immediately usable and don't require waiting in the PreBind phase.
This prioritization is already implemented in the GatherPools() function on my PR.

@mortent
Copy link
Member

mortent commented Jun 9, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 9, 2025
@mortent
Copy link
Member

mortent commented Jun 9, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 9, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 10, 2025
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Tentative API review, overall design.

@github-project-automation github-project-automation bot moved this from Needs Review to Needs Approval in SIG Scheduling Jun 10, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2025
@pohly
Copy link
Contributor

pohly commented Jun 10, 2025

I suppose this doesn't need a PRR re-review: that section was already reviewed when merging for 1.33 (#5012 (comment)) and the changes since then where mostly around API aspects, not fundamental changes of the design.

@dom4ha: okay to approve and merge?

@KobayashiD27
Copy link
Contributor Author

In the previous discussion on the implementation PR for KEP-5007 (kubernetes/kubernetes#130160), we received feedback that the API description was insufficient, that the "fail and reschedule" usage pattern mentioned in the KEP was considered an anti-pattern, and that the overall description of the KEP needed clarification.

Based on this feedback, we created this KEP update PR to make the KEP more generic and clearer, with a focus on the motivation, goals, and prioritization of device selection. These updates aim to better reflect the intent of the API, including BindingConditions, and the scheduling behavior based on them.

Given these changes, would it be necessary to conduct an API review for this KEP update PR itself? We would appreciate it if reviewers could take a look.

cc @thockin

@pohly
Copy link
Contributor

pohly commented Jun 11, 2025

Given these changes, would it be necessary to conduct an API review for this KEP update PR itself?

I think we can skip it for this PR. We can consider my review as a preliminary API review, to be ratified during the implementation review, and as you said, the API hasn't changed that much anyway since the previous PR.

1. **Set NodeSelector**: Before the `PreBind` phase, add the `NodeName` to the `ResourceClaim`'s `NodeSelector`.

If Conditions are present, the scheduler DRA plugin will perform the following steps during the `PreBind` phase:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Set NodeSelector in api-server: Before the PreBind phase, add the NodeName to the ResourceClaim's NodeSelector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have still some small corrections

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025
Copy link
Member

@macsko macsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only nits

1. **Set NodeSelector**: Before the `PreBind` phase, add the `NodeName` to the `ResourceClaim`'s `NodeSelector`.

If Conditions are present, the scheduler DRA plugin will perform the following steps during the `PreBind` phase:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have still some small corrections

@kannon92
Copy link
Contributor

Just to confirm, @johnbelamaric approval is not needed here for PRR.

If so, I can mark this as no-need for this round since it was already approved for alpha. Has the design changed enough to warrant a new review?

@pohly
Copy link
Contributor

pohly commented Jun 12, 2025

I think we don't need another PRR review, the core design is still the same.

@dom4ha
Copy link
Member

dom4ha commented Jun 13, 2025

/approve for sig-scheduling

Thanks for updating this KEP!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, KobayashiD27, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
@pohly
Copy link
Contributor

pohly commented Jun 13, 2025

/lgtm

Please squash commits, I'll re-add LGTM if needed.

/hold

For confirmation that no new PRR review is needed (see also https://kubernetes.slack.com/archives/CPNHUMN74/p1749757599480619).

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 13, 2025
@KobayashiD27 KobayashiD27 force-pushed the dra-binding-conditions branch from 7e3a418 to 22d2875 Compare June 13, 2025 11:08
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2025
@KobayashiD27
Copy link
Contributor Author

Thanks all!
@pohly
I squash some commits, Please re-add LGTM?

KobayashiD27 and others added 4 commits June 13, 2025 20:11
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
@soltysh
Copy link
Contributor

soltysh commented Jun 13, 2025

I think we don't need another PRR review, the core design is still the same.

This is not progressing between stages, so no PRR is required at this point in time.

@pohly
Copy link
Contributor

pohly commented Jun 13, 2025

/lgtm

/hold cancel
Because PRR is indeed not needed (thanks @soltysh!).

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit bde597e into kubernetes:master Jun 13, 2025
3 of 4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 13, 2025
@github-project-automation github-project-automation bot moved this from Needs Approval to Done in SIG Scheduling Jun 13, 2025
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Done
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants