Skip to content

Conversation

JesseStutler
Copy link
Member

@JesseStutler JesseStutler commented Oct 31, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Background

Inspired by #3798 , introduce DRA into volcano. And #4152 have added PreBind implementation, so DRA can base on this PR and add DRA plugin implementation into predicates plugin.

Verfication

I followed up https://github.com/kubernetes-sigs/dra-example-driver to deploy an example resource driver and deploy the demo yaml gpu-test{1,2,3,4,5}.yaml, the pod can be successfully scheduled:
image
GPU information printed in containers:
image
And allocation result in ResourceClaim is successfully updated:
image

@volcano-sh-bot volcano-sh-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 31, 2024
@googs1025
Copy link
Member

/cc

@@ -25,6 +25,7 @@ custom:
leader_elect_enable: false
enabled_admissions: "/jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate"
colocation_enable: false
feature_gates: DynamicResourceAllocation=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to enable this feature-gate by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? Do we need to enable it by default? I think that if users need to enable dra plugin, they also need to reconfigure the scheduler plugins configmap to enable it, it also looks like featuregate. If we don't enable this feture-gate, users need to do reconfiguring the scheduler plugins configmap and enabling the feature-gate two steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we need not to enable this feature-gate by default, I will add some informers in schedulerCache, some users may need to use k8s below v1.31, if there are no APIs like DeviceClass, ResourceClaim, etc registered, it will cause some errors.

@@ -0,0 +1,808 @@
/*
Copyright 2022 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part directly imported into the scheduler code? Or has it been modified and adapted (I don't know much about it)? If not, should it be marked in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed the New to NewDRAPlugin and deleted EventsToRegister and some related codes with it, I think volcano doesn't need them right now. These codes are copied from the kubernetes repo, so the license needs to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, if so, if the upstream code changes, do I need to make the same changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, before the dynamicResources struct changed to DynamicResources, we can only copy the whole codes related into volcano, otherwise we can't directly call those extension points. It may cause us to miss tracking some repaired codes, but if we can access the DynamicResources struct in v1.32, the problem will be solved.

@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 31, 2024
@JesseStutler JesseStutler force-pushed the dra_dev branch 2 times, most recently from de77070 to b120b42 Compare November 1, 2024 09:37
@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 16, 2024
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2024
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 26, 2024
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2025
Copy link

stale bot commented Apr 25, 2025

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2025
@hwdef
Copy link
Member

hwdef commented May 5, 2025

still need

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 5, 2025
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2025
@JesseStutler JesseStutler changed the title feature: Add dynamic resource allocation(DRA) plugin [WIP]feature: Add dynamic resource allocation(DRA) plugin May 15, 2025
@JesseStutler JesseStutler force-pushed the dra_dev branch 2 times, most recently from 4462d05 to c3d892c Compare May 23, 2025 08:51
@JesseStutler JesseStutler requested review from googs1025 May 23, 2025 09:26
@Monokaix Monokaix requested a review from Copilot May 23, 2025 09:31
@Monokaix
Copy link
Member

Great job!

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Dynamic Resource Allocation (DRA) support into the Volcano scheduler by wiring a new DRA plugin into the predicate phase, extending cache and framework handle to supply a shared DRAManager, and updating tests and installer manifests to enable and verify the feature.

  • Add builders for DRA objects (ResourceClaim, ResourceSlice, DeviceClass, DeviceRequest) and a helper to create pods with resource claims.
  • Integrate SharedDRAManager into scheduler cache, framework handle, and predicates plugin lifecycle (PreFilter, Filter, Reserve, Unreserve, PreBind, PreBindRollBack).
  • Update installer manifests, Helm charts, Makefile, and end-to-end (E2E) tests to enable the DynamicResourceAllocation feature gate and required RBAC.

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/scheduler/util/test_utils.go Added DRA builders and BuildPodWithResourceClaim
pkg/scheduler/uthelper/helper.go Populated cache with DeviceClasses, ResourceClaims, ResourceSlices
pkg/scheduler/plugins/util/k8s/framework.go Added WithSharedDRAManager option to pass DRAManager into plugins
pkg/scheduler/plugins/predicates/predicates.go Wired DRA plugin into PreFilter, Filter, Reserve, Unreserve, PreBind
pkg/scheduler/cache/cache.go Initialized and exposed sharedDRAManager in the scheduler cache
pkg/scheduler/actions/allocate/allocate.go Left TODO for PostFilter extension point for DRA
Comments suppressed due to low confidence (1)

pkg/scheduler/util/test_utils.go:101

  • Fix typo in comment: change 'builts' to 'builds'.
// BuildPodWithPVC builts Pod object with pvc volume

Signed-off-by: JesseStutler <chenzicong4@huawei.com>
@Monokaix
Copy link
Member

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2025
@wangyang0616
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2025
@volcano-sh-bot volcano-sh-bot merged commit c3cba84 into volcano-sh:master May 27, 2025
18 checks passed
dongjoon-hyun added a commit to apache/spark that referenced this pull request Jul 1, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade `Volcano` to 1.12.1 in K8s integration test document and GA job.

### Why are the changes needed?

To bring the latest improvements and security fixes.
- https://github.com/volcano-sh/volcano/releases/tag/v1.12.1
  - volcano-sh/volcano#4336
- https://github.com/volcano-sh/volcano/releases/tag/v1.12.0
  - volcano-sh/volcano#4099
  - volcano-sh/volcano#3799
  - volcano-sh/volcano#4207
  - GHSA-hg79-fw4p-25p8

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #51343 from dongjoon-hyun/SPARK-52639.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
asl3 pushed a commit to asl3/spark that referenced this pull request Jul 14, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade `Volcano` to 1.12.1 in K8s integration test document and GA job.

### Why are the changes needed?

To bring the latest improvements and security fixes.
- https://github.com/volcano-sh/volcano/releases/tag/v1.12.1
  - volcano-sh/volcano#4336
- https://github.com/volcano-sh/volcano/releases/tag/v1.12.0
  - volcano-sh/volcano#4099
  - volcano-sh/volcano#3799
  - volcano-sh/volcano#4207
  - GHSA-hg79-fw4p-25p8

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51343 from dongjoon-hyun/SPARK-52639.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade `Volcano` to 1.12.1 in K8s integration test document and GA job.

### Why are the changes needed?

To bring the latest improvements and security fixes.
- https://github.com/volcano-sh/volcano/releases/tag/v1.12.1
  - volcano-sh/volcano#4336
- https://github.com/volcano-sh/volcano/releases/tag/v1.12.0
  - volcano-sh/volcano#4099
  - volcano-sh/volcano#3799
  - volcano-sh/volcano#4207
  - GHSA-hg79-fw4p-25p8

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51343 from dongjoon-hyun/SPARK-52639.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants