Skip to content

Conversation

ZhengW22
Copy link

What type of PR is this?

What this PR does / why we need it:
This PR adds the capability to disable GPUs at the node level by applying annotations to nodes. GPUs matching the specified UUIDs will no longer be allocated to any pods.

The implementation works by setting the used count of the corresponding node GPUs to their maximum capacity when calculating nodeUsage, effectively occupying those resources. This approach maintains compatibility with scheduling logic for different types of GPU cards.

Which issue(s) this PR fixes:
No.

Special notes for your reviewer:
No.

Does this PR introduce a user-facing change?:
No.

@hami-robot hami-robot bot requested a review from chaunceyjiang July 15, 2025 08:30
@hami-robot hami-robot bot requested a review from ouyangluwei163 July 15, 2025 08:30
@github-actions github-actions bot added the kind/feature new function label Jul 15, 2025
@hami-robot hami-robot bot added the size/L label Jul 15, 2025
@ZhengW22
Copy link
Author

I recreate the pr #1154 to add the sign-off info.

@archlitchi
Copy link
Member

please fix the go-lint

@ZhengW22
Copy link
Author

~> make verify
hack/verify-all.sh

  • bash hack/../hack/verify-staticcheck.sh
    Using golangci-lint version:
    golangci-lint has version 2.2.1 built with go1.24.4 from (unknown, modified: ?, mod sum: "h1:01r5ueY3oq8gtqgA5TGtBcS+LYZ/dEzZ59/AN1NsT2E=") on (unknown)
    0 issues.
    Congratulations! All Go source files have passed staticcheck.
  • bash hack/../hack/verify-license.sh
    ++ dirname hack/../hack/verify-license.sh
  • REPO_ROOT=hack/../hack/..
  • cd hack/../hack/..
    ++ which addlicense
  • [[ /home/s123zz123/Code/Library/go/bin/addlicense == '' ]]
    ++ which addlicense
  • ADDLICENSE_BIN=/home/s123zz123/Code/Library/go/bin/addlicense
    ++ /home/s123zz123/Code/Library/go/bin/addlicense -check -ignore 'benchmarks/' -ignore 'charts/' -ignore 'docs/' -ignore 'docker/' -ignore 'examples/' -ignore 'lib/' -ignore 'libvgpu/' -ignore 'third_party/' -ignore 'vendor/' -ignore '_output/' -ignore '.github/' -ignore '/.md' -ignore '**/.yaml' -ignore '/*.yml' -ignore '/*.json' -ignore '.idea/**' .
  • missing_license_header_files=
  • [[ -n '' ]]
  • echo 'Congratulations! All files have passed license header check.'
    Congratulations! All files have passed license header check.
  • bash hack/../hack/verify-import-aliases.sh
    checking-imports:
    /home/s123zz123/Code/fork/HAMi/cmd
    /home/s123zz123/Code/fork/HAMi/cmd/device-plugin
    /home/s123zz123/Code/fork/HAMi/cmd/device-plugin/nvidia
    /home/s123zz123/Code/fork/HAMi/cmd/scheduler
    /home/s123zz123/Code/fork/HAMi/cmd/vGPUmonitor
    /home/s123zz123/Code/fork/HAMi/cmd/vGPUmonitor/noderpc
    /home/s123zz123/Code/fork/HAMi/cmd/vGPUmonitor/testcollector
    /home/s123zz123/Code/fork/HAMi/pkg
    /home/s123zz123/Code/fork/HAMi/pkg/device
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/cdi
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/info
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/mig
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/plugin
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/plugin/manager
    /home/s123zz123/Code/fork/HAMi/pkg/device-plugin/nvidiadevice/nvinternal/rm
    /home/s123zz123/Code/fork/HAMi/pkg/device/ascend
    /home/s123zz123/Code/fork/HAMi/pkg/device/cambricon
    /home/s123zz123/Code/fork/HAMi/pkg/device/common
    /home/s123zz123/Code/fork/HAMi/pkg/device/enflame
    /home/s123zz123/Code/fork/HAMi/pkg/device/hygon
    /home/s123zz123/Code/fork/HAMi/pkg/device/iluvatar
    /home/s123zz123/Code/fork/HAMi/pkg/device/kunlun
    /home/s123zz123/Code/fork/HAMi/pkg/device/metax
    /home/s123zz123/Code/fork/HAMi/pkg/device/mthreads
    /home/s123zz123/Code/fork/HAMi/pkg/device/nvidia
    /home/s123zz123/Code/fork/HAMi/pkg/k8sutil
    /home/s123zz123/Code/fork/HAMi/pkg/monitor
    /home/s123zz123/Code/fork/HAMi/pkg/monitor/nvidia
    /home/s123zz123/Code/fork/HAMi/pkg/monitor/nvidia/v0
    /home/s123zz123/Code/fork/HAMi/pkg/monitor/nvidia/v1
    /home/s123zz123/Code/fork/HAMi/pkg/oci
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler/config
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler/policy
    /home/s123zz123/Code/fork/HAMi/pkg/scheduler/routes
    /home/s123zz123/Code/fork/HAMi/pkg/util
    /home/s123zz123/Code/fork/HAMi/pkg/util/client
    /home/s123zz123/Code/fork/HAMi/pkg/util/client/testdata
    /home/s123zz123/Code/fork/HAMi/pkg/util/flag
    /home/s123zz123/Code/fork/HAMi/pkg/util/nodelock
    /home/s123zz123/Code/fork/HAMi/pkg/version

Passed import-aliases verification.

I have already fixed all issue.

@archlitchi
Copy link
Member

CC @Shouren

Copy link

codecov bot commented Jul 16, 2025

Codecov Report

❌ Patch coverage is 89.28571% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/scheduler/nodes.go 89.28% 2 Missing and 1 partial ⚠️
Flag Coverage Δ
unittests 66.41% <89.28%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/scheduler/nodes.go 86.36% <89.28%> (+1.36%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ZhengW22
Copy link
Author

Do we need 100% code coverage to new code?

@wawa0210
Copy link
Member

Do we need 100% code coverage to new code?

We recommend focusing on unit test coverage, at least new code needs to be covered

@lengrongfu
Copy link
Member

@ZhengW22
Copy link
Author

This function uses on the pod annotation and only works with nvidia gpu. My commit can use on the node annotation and support all current vendors. From excluding scope, this function guarantees that pod avoids use gpu by uuid and the purpose of my commit is that node avoids use gpu by uuid.

@ZhengW22
Copy link
Author

Hello, someone can help me to review this pr?

@wawa0210
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to disable specific GPUs on a node via annotations, which is a valuable addition. The implementation is generic and supports multiple vendors, which is great.

However, I've found a few issues:

  • There's a discrepancy between the PR description and the implementation. The description states that disabled GPUs are marked as fully utilized, while the code removes them from the scheduler's view. Please clarify the intended behavior and update the description if necessary.
  • There appears to be a critical compilation error due to the use of undefined constants from the metax package.
  • The parsing of UUIDs from annotations could be more robust.
  • The unit tests for the new functionality are minimal and should be expanded to cover more scenarios.

I've left detailed comments on these points. Please address them to ensure the quality and correctness of the feature.

Comment on lines 40 to 47
var vendorUUIDMap = map[string][]string{
nvidia.GPUNoUseUUID: {nvidia.NvidiaGPUDevice},
cambricon.MLUNoUseUUID: {cambricon.CambriconMLUDevice},
hygon.DCUNoUseUUID: {hygon.HygonDCUDevice},
iluvatar.IluvatarNoUseUUID: {iluvatar.IluvatarGPUDevice},
enflame.EnflameNoUseUUID: {enflame.EnflameGPUDevice},
mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice},
metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The vendorUUIDMap uses constants from various device packages. However, metax.MetaxNoUseUUID and metax.MetaxSGPUDevice do not seem to be defined in pkg/device/metax/device.go. This will cause a compilation error. Please ensure all constants used here are defined in their respective packages.

Similarly, enflame.EnflameNoUseUUID is used, but its definition is not visible in the PR context. Please verify it exists.

docs/config.md Outdated
* `nvidia.com/nouse-gpuuuid`:
String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-XXX"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example ie: "GPU-XXX" only shows a single device ID, but the description says it supports multiple. It would be more helpful to provide an example with multiple IDs to match the description and the Chinese documentation.

Suggested change
String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-XXX"
String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-AAA,GPU-BBB"

Comment on lines 101 to 103
for annokKey, vendor := range vendorUUIDMap {
klog.V(5).Infof("Current annokey is %s, and vendor is %v", annokKey, vendor)
if value, ok := nodeInfo.Node.Annotations[annokKey]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the variable name annokKey. It should be annoKey. The log message also contains a typo annokey.

Suggested change
for annokKey, vendor := range vendorUUIDMap {
klog.V(5).Infof("Current annokey is %s, and vendor is %v", annokKey, vendor)
if value, ok := nodeInfo.Node.Annotations[annokKey]; ok {
for annoKey, vendor := range vendorUUIDMap {
klog.V(5).Infof("Current annoKey is %s, and vendor is %v", annoKey, vendor)
if value, ok := nodeInfo.Node.Annotations[annoKey]; ok {

Comment on lines 106 to 117
for _, disableGPUUUID := range disableGPUUUIDList {
disableGPUUUIDVendorMap[disableGPUUUID] = vendor
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parsing of comma-separated UUIDs can be made more robust. If the annotation value contains extra spaces or empty parts (e.g., "uuid1, ,uuid2"), strings.Split will produce empty strings in the list, which might lead to unexpected behavior. It's good practice to trim whitespace from each UUID and ignore empty ones.

for _, disableGPUUUID := range disableGPUUUIDList {
					if id := strings.TrimSpace(disableGPUUUID); id != "" {
						disableGPUUUIDVendorMap[id] = vendor
					}
				}

Comment on lines 329 to 424
func Test_rmDeviceByNodeAnnotation(t *testing.T) {
id1 := "60151478-4709-4242-a8c1-a944252d194b"
type args struct {
nodeInfo *util.NodeInfo
}
tests := []struct {
name string
args args
want []util.DeviceInfo
}{
{
name: "Test remove device",
args: args{
nodeInfo: &util.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id1}}},
Devices: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
},
want: []util.DeviceInfo{},
},
{
name: "Test no removing device",
args: args{
nodeInfo: &util.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{"test-key": ""}}},
Devices: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
},
want: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := rmDeviceByNodeAnnotation(tt.args.nodeInfo); !reflect.DeepEqual(got, tt.want) {
t.Errorf("rmDeviceByNodeAnnotation() = %v, want %v", got, tt.want)
}
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test cases for rmDeviceByNodeAnnotation are not comprehensive enough. Please consider adding more tests to cover the following scenarios:

  • Multiple UUIDs in the annotation value.
  • Multiple devices on the node, with some being removed and some not.
  • Multiple annotations for different vendors on the same node.
  • Malformed annotation values (e.g., with extra spaces or empty parts like "uuid1,,uuid2").
  • A case where a UUID matches but the device vendor does not.

@ZhengW22
Copy link
Author

ZhengW22 commented Aug 5, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

"github.com/Project-HAMi/HAMi/pkg/device/nvidia"
)

var vendorUUIDMap = map[string][]string{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhengW22 It is a map mapping from a key in annotation to vendor Name, the variable name vendorUUIDMap make me confused.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I change the map name from vendorUUIDMap to vendorNoUseAnnoKeyMap?

}

func rmDeviceByNodeAnnotation(nodeInfo *util.NodeInfo) []util.DeviceInfo {
disableGPUUUIDVendorMap := make(map[string][]string)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhengW22 I prefer a map[vendorName]map[uuid]bool map which avoiding potential UUID conflict

@Shouren
Copy link
Collaborator

Shouren commented Aug 6, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

@ZhengW22 The docs in docs directory are moving to website and changes to this directory is not allowed now, so please submit a PR to website to update the docs.

@ZhengW22
Copy link
Author

ZhengW22 commented Aug 8, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

@ZhengW22 The docs in docs directory are moving to website and changes to this directory is not allowed now, so please submit a PR to website to update the docs.

Do you mean i should delete md change in this project? And I only find English version of the documents in this project https://github.com/Project-HAMi/website, does this mean that I only need to submit the English version of the documentation?

@ZhengW22
Copy link
Author

ZhengW22 commented Aug 8, 2025

@Shouren Hello, I modify the code logic according to your suggestions.

@Shouren
Copy link
Collaborator

Shouren commented Aug 11, 2025

@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those.

@ZhengW22 The docs in docs directory are moving to website and changes to this directory is not allowed now, so please submit a PR to website to update the docs.

Do you mean i should delete md change in this project? And I only find English version of the documents in this project https://github.com/Project-HAMi/website, does this mean that I only need to submit the English version of the documentation?

@Nimbus318 Can you tell @ZhengW22 how to add Chinese version of the docs in website repo ?

@Nimbus318
Copy link
Contributor

Nimbus318 commented Aug 11, 2025

@ZhengW22 You need to update both the corresponding English docs in the docs/userguide directory and the corresponding Chinese docs in the i18n/zh/docusaurus-plugin-content-docs/current/userguide directory.
After these changes, they will appear on https://project-hami.io/docs/next/ and will be included in the versioned docs for future releases.
No changes are needed for other already versioned directories.

@ZhengW22
Copy link
Author

@Shouren , Hello, I already created the new documents pr in project website.
Project-HAMi/website#100

@Shouren
Copy link
Collaborator

Shouren commented Aug 20, 2025

@Shouren , Hello, I already created the new documents pr in project website. Project-HAMi/website#100

@ZhengW22 Please remove the docs in this PR and i will check it later.

@ZhengW22
Copy link
Author

@Shouren Hello, I have already the doc file.

@hami-robot hami-robot bot added the lgtm label Aug 22, 2025
Copy link
Collaborator

@Shouren Shouren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot
Copy link
Contributor

hami-robot bot commented Aug 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Shouren, ZhengW22
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@archlitchi
Copy link
Member

please rebase the code to pass the CI

ZhengW22 added 8 commits September 10, 2025 15:13
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
…." This reverts commit 8ba1840.

Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
@hami-robot
Copy link
Contributor

hami-robot bot commented Sep 10, 2025

New changes are detected. LGTM label has been removed.

Signed-off-by: ZhengW22 <s123zz123@gmail.com>
@ZhengW22
Copy link
Author

@Shouren @archlitchi Hello, I have rebased the code and fix some problems.

@ZhengW22
Copy link
Author

@Shouren Hello, can you help me to review the new code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants