-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Update KEP-5007 DRA Device Binding Conditions #5342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update KEP-5007 DRA Device Binding Conditions #5342
Conversation
Hi @KobayashiD27. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@pohly @johnbelamaric @dom4ha @macsko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is getting closer to being ready for merging. Some suggestions and one API gap.
In scenarios where attachment occurs after scheduling, there is a risk that the resource cannot be attached at the time of attachment, causing the container to remain in the "Container Creating" state. | ||
The mechanism is not tied to any specific hardware model or infrastructure. | ||
It can support a wide range of use cases, including: | ||
- Fabric-attached GPUs that require dynamic attachment via PCIe or CXL switches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for mentioning CXL. This has indeed come up in recent discussions, and this KEP is relevant for it.
By having the scheduler wait for the fabric device to be attached, we can reschedule the pod if the attachment fails. | ||
This approach is superior because it avoids unnecessary waiting and allows for immediate rescheduling. | ||
While the original motivation came from fabric-attached devices, the mechanism is designed to be broadly applicable. | ||
It can support other scenarios where resource readiness is asynchronous or failure-prone, such as remote accelerators or gang scheduling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also 👍 for gang scheduling.
This KEP is a good building block for exploring advanced use cases.
It allows the scheduler to make binding decisions based on up-to-date readiness information, improving reliability and avoiding premature binding. | ||
|
||
While this proposal supports fabric-attached devices, it is not limited to them. | ||
The mechanism is designed to be general and can support other use cases where resource readiness is asynchronous or failure-prone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been said a few times. I think you can remove the entire paragraph here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this paragraph.
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
External controllers (e.g., composable DRA controllers) are responsible for monitoring the `ResourceClaim` and updating the condition statuses as device preparation progresses. | ||
This coordination allows the scheduler to make informed binding decisions without requiring tight coupling between the scheduler and device-specific logic. | ||
|
||
### Handling ResourceSlices Upon Failure of Attachment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This same section is present a second time below. I think this here is a better place for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed duplicate sections.
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
``` | ||
|
||
#### AllocatedDeviceStatus Enhancements | ||
#### DeviceRequestAllocationResult Enhancements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a section above for adding AllocationTimestamp
to AllocationResult
, it's currently missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right, I'll update it to include a description about AllocationTimestamp
.
2. **Attribute Information for Fabric Devices**: | ||
Add attribute information that clearly distinguishes fabric devices requiring attachment. | ||
This will help in accurately identifying and managing these devices within the Kubernetes environment. | ||
3. **Prioritize Devices Based on Readiness**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain if we can make it a general rule without going into details of specific conditions. Do you mean it's always true that devices will lower number of conditions should be favorable or just assume that lack of conditions means the devices does not need attachment?
I think in general case it's not that simple and the number of conditions is rather an implementation detail. I think that scheduler should have more explicit information denoting a need for attachment (scale up) and possibly some information how much time it should take.
I'd rather think whether the timeout wasn't a better indicator which could be taken into account here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple heuristic is that without binding conditions, there's no delay and thus such devices are "better" than devices with. I agree that we can improve this further by sorting by timeout: devices without binding conditions have a zero timeout, devices without an explicit timeout the default timeout, and others the specified one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, having none vs something seems good enough heuristic for the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
Yes, the current logic prioritizes devices without BindingConditions. This is because devices without conditions are immediately usable and don't require waiting in the PreBind phase.
This prioritization is already implemented in the GatherPools() function on my PR.
/wg device-management |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Tentative API review, overall design.
I suppose this doesn't need a PRR re-review: that section was already reviewed when merging for 1.33 (#5012 (comment)) and the changes since then where mostly around API aspects, not fundamental changes of the design. @dom4ha: okay to approve and merge? |
In the previous discussion on the implementation PR for KEP-5007 (kubernetes/kubernetes#130160), we received feedback that the API description was insufficient, that the "fail and reschedule" usage pattern mentioned in the KEP was considered an anti-pattern, and that the overall description of the KEP needed clarification. Based on this feedback, we created this KEP update PR to make the KEP more generic and clearer, with a focus on the motivation, goals, and prioritization of device selection. These updates aim to better reflect the intent of the API, including BindingConditions, and the scheduling behavior based on them. Given these changes, would it be necessary to conduct an API review for this KEP update PR itself? We would appreciate it if reviewers could take a look. cc @thockin |
I think we can skip it for this PR. We can consider my review as a preliminary API review, to be ratified during the implementation review, and as you said, the API hasn't changed that much anyway since the previous PR. |
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
1. **Set NodeSelector**: Before the `PreBind` phase, add the `NodeName` to the `ResourceClaim`'s `NodeSelector`. | ||
|
||
If Conditions are present, the scheduler DRA plugin will perform the following steps during the `PreBind` phase: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Set NodeSelector in api-server: Before the
PreBind
phase, add theNodeName
to theResourceClaim
'sNodeSelector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have still some small corrections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, only nits
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
Outdated
Show resolved
Hide resolved
1. **Set NodeSelector**: Before the `PreBind` phase, add the `NodeName` to the `ResourceClaim`'s `NodeSelector`. | ||
|
||
If Conditions are present, the scheduler DRA plugin will perform the following steps during the `PreBind` phase: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have still some small corrections
Just to confirm, @johnbelamaric approval is not needed here for PRR. If so, I can mark this as no-need for this round since it was already approved for alpha. Has the design changed enough to warrant a new review? |
I think we don't need another PRR review, the core design is still the same. |
/approve for sig-scheduling Thanks for updating this KEP! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, KobayashiD27, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm Please squash commits, I'll re-add LGTM if needed. /hold For confirmation that no new PRR review is needed (see also https://kubernetes.slack.com/archives/CPNHUMN74/p1749757599480619). |
7e3a418
to
22d2875
Compare
Thanks all! |
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
This is not progressing between stages, so no PRR is required at this point in time. |
This PR updates KEP-5007 to clarify the scope, motivation, and design of the proposed BindingConditions mechanism.
Key updates include:
These changes aim to make the KEP easier to review and better aligned with the feedback received during discussion. Feedback is welcome!
related to : kubernetes/kubernetes#130160