-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: Add dynamic resource allocation(DRA) plugin #3799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/cc |
@@ -25,6 +25,7 @@ custom: | |||
leader_elect_enable: false | |||
enabled_admissions: "/jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate" | |||
colocation_enable: false | |||
feature_gates: DynamicResourceAllocation=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to enable this feature-gate by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? Do we need to enable it by default? I think that if users need to enable dra plugin, they also need to reconfigure the scheduler plugins configmap to enable it, it also looks like featuregate. If we don't enable this feture-gate, users need to do reconfiguring the scheduler plugins configmap and enabling the feature-gate two steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK we need not to enable this feature-gate by default, I will add some informers in schedulerCache, some users may need to use k8s below v1.31, if there are no APIs like DeviceClass
, ResourceClaim
, etc registered, it will cause some errors.
@@ -0,0 +1,808 @@ | |||
/* | |||
Copyright 2022 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part directly imported into the scheduler code? Or has it been modified and adapted (I don't know much about it)? If not, should it be marked in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just changed the New
to NewDRAPlugin
and deleted EventsToRegister
and some related codes with it, I think volcano doesn't need them right now. These codes are copied from the kubernetes repo, so the license needs to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand, if so, if the upstream code changes, do I need to make the same changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, before the dynamicResources
struct changed to DynamicResources
, we can only copy the whole codes related into volcano, otherwise we can't directly call those extension points. It may cause us to miss tracking some repaired codes, but if we can access the DynamicResources
struct in v1.32, the problem will be solved.
de77070
to
b120b42
Compare
b120b42
to
72d27ea
Compare
72d27ea
to
d57bdeb
Compare
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
still need |
4462d05
to
c3d892c
Compare
Great job! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Dynamic Resource Allocation (DRA) support into the Volcano scheduler by wiring a new DRA plugin into the predicate phase, extending cache and framework handle to supply a shared DRAManager, and updating tests and installer manifests to enable and verify the feature.
- Add builders for DRA objects (ResourceClaim, ResourceSlice, DeviceClass, DeviceRequest) and a helper to create pods with resource claims.
- Integrate SharedDRAManager into scheduler cache, framework handle, and predicates plugin lifecycle (PreFilter, Filter, Reserve, Unreserve, PreBind, PreBindRollBack).
- Update installer manifests, Helm charts, Makefile, and end-to-end (E2E) tests to enable the
DynamicResourceAllocation
feature gate and required RBAC.
Reviewed Changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/scheduler/util/test_utils.go | Added DRA builders and BuildPodWithResourceClaim |
pkg/scheduler/uthelper/helper.go | Populated cache with DeviceClasses, ResourceClaims, ResourceSlices |
pkg/scheduler/plugins/util/k8s/framework.go | Added WithSharedDRAManager option to pass DRAManager into plugins |
pkg/scheduler/plugins/predicates/predicates.go | Wired DRA plugin into PreFilter, Filter, Reserve, Unreserve, PreBind |
pkg/scheduler/cache/cache.go | Initialized and exposed sharedDRAManager in the scheduler cache |
pkg/scheduler/actions/allocate/allocate.go | Left TODO for PostFilter extension point for DRA |
Comments suppressed due to low confidence (1)
pkg/scheduler/util/test_utils.go:101
- Fix typo in comment: change 'builts' to 'builds'.
// BuildPodWithPVC builts Pod object with pvc volume
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
### What changes were proposed in this pull request? This PR aims to upgrade `Volcano` to 1.12.1 in K8s integration test document and GA job. ### Why are the changes needed? To bring the latest improvements and security fixes. - https://github.com/volcano-sh/volcano/releases/tag/v1.12.1 - volcano-sh/volcano#4336 - https://github.com/volcano-sh/volcano/releases/tag/v1.12.0 - volcano-sh/volcano#4099 - volcano-sh/volcano#3799 - volcano-sh/volcano#4207 - GHSA-hg79-fw4p-25p8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51343 from dongjoon-hyun/SPARK-52639. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `Volcano` to 1.12.1 in K8s integration test document and GA job. ### Why are the changes needed? To bring the latest improvements and security fixes. - https://github.com/volcano-sh/volcano/releases/tag/v1.12.1 - volcano-sh/volcano#4336 - https://github.com/volcano-sh/volcano/releases/tag/v1.12.0 - volcano-sh/volcano#4099 - volcano-sh/volcano#3799 - volcano-sh/volcano#4207 - GHSA-hg79-fw4p-25p8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51343 from dongjoon-hyun/SPARK-52639. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `Volcano` to 1.12.1 in K8s integration test document and GA job. ### Why are the changes needed? To bring the latest improvements and security fixes. - https://github.com/volcano-sh/volcano/releases/tag/v1.12.1 - volcano-sh/volcano#4336 - https://github.com/volcano-sh/volcano/releases/tag/v1.12.0 - volcano-sh/volcano#4099 - volcano-sh/volcano#3799 - volcano-sh/volcano#4207 - GHSA-hg79-fw4p-25p8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51343 from dongjoon-hyun/SPARK-52639. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Background
Inspired by #3798 , introduce DRA into volcano. And #4152 have added
PreBind
implementation, so DRA can base on this PR and add DRA plugin implementation into predicates plugin.Verfication
I followed up https://github.com/kubernetes-sigs/dra-example-driver to deploy an example resource driver and deploy the demo yaml gpu-test{1,2,3,4,5}.yaml, the pod can be successfully scheduled:



GPU information printed in containers:
And allocation result in ResourceClaim is successfully updated: