Add MPS control daemon support to k8s device plugin#789
Merged
yeazelm merged 1 commit intobottlerocket-os:developfrom Jan 16, 2026
Merged
Add MPS control daemon support to k8s device plugin#789yeazelm merged 1 commit intobottlerocket-os:developfrom
yeazelm merged 1 commit intobottlerocket-os:developfrom
Conversation
This was referenced Dec 31, 2025
vigh-m
reviewed
Jan 6, 2026
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon-exec-start-conf
Outdated
Show resolved
Hide resolved
bcressey
reviewed
Jan 9, 2026
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon.service
Outdated
Show resolved
Hide resolved
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon-exec-start-conf
Show resolved
Hide resolved
KCSesh
reviewed
Jan 13, 2026
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon.service
Outdated
Show resolved
Hide resolved
Contributor
Author
|
^ Updated the code to use the Type changes (thanks @KCSesh!) and responded to a few other comments. There is also a new change that does the MIG and MPS incompatibility check in the template rendering. It echo's a warning these don't work together. This can easily be removed if NVIDIA removes this incompatibility in a future release of their device plugin. |
bcressey
reviewed
Jan 15, 2026
Add support for NVIDIA Multi-Process Service (MPS) control daemon, including service configuration and device plugin updates. Signed-off-by: Matthew Yeazel <yeazelm@amazon.com>
Contributor
Author
|
^ Updated to address comments around |
bcressey
approved these changes
Jan 15, 2026
KCSesh
approved these changes
Jan 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue number:
Related to: bottlerocket-os/bottlerocket#4673
Description of changes:
This builds the mps-control-daemon binary from the device plugin that allows MPS support. We have to patch the hardcoded paths for Bottlerocket usage since the device plugin assumes it can write to / which doesn't work with Bottlerocket.
This change also adds a new service to start this binary when settings request it. Otherwise it daemonizes
sleep infinityto let systemdtry-restartupon changing the settings for MPS.The change should be safe to take without the bottlerocket-os/bottlerocket-kernel-kit#347 change or the upcoming settings change but the daemon will not work without the kmod update and the settings being properly set.
Testing done:
Build images with the kernel change, settings changes, and validated that a node will come up with MPS working if set in user data, and the services are restarted and MPS can be enabled at runtime as well.
Setting in userdata for a g6.2xlarge which only has one GPU
Details
eksctlconfig snippet for setting it at the beginning:Results in a node reporting nvidia.com/gpu.shared:
Setting the MPS after boot
Details
Start with a node with no configuration for MPS:
The node shows one GPU:
Then set MPS:
Now check the rest of the system:
And the node shows the empty nvidia.com/gpu offering but now a shared one:
This is a known edge case and is similar to how timeslicing works. In order to avoid old resources, you'd need to start with the user-data approach.
Shifting to
rename-by-default=false(apiclient set settings.kubelet-device-plugins.nvidia.mps.rename-by-default=false) will have the original nvidia.com/gpu resource instead:And finally, setting sharing to
nonedisables MPS:And the resource goes back down to 1.
With the incompatibility checks in the template. You can see the messages preventing both MIG and MPS from running at the same time:
Details
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.