Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add nvidia time-slicing #169

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

KCSesh
Copy link
Contributor

@KCSesh KCSesh commented Oct 1, 2024

Issue number:

Related:

Description of changes:
Adding time-slicing configs to the core-kit and pointing to the latest settings-sdk.

Testing done:

  1. Instance joined the cluster
NAME                                           STATUS   ROLES    AGE   VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   15h   v1.29.5-eks-1109419
  1. Model Default:
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "none",
        "pass-device-specs": true,
      }
    }
  }
}
  1. Model Updates:
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=time-slicing settings.kubelet-device-plugins.nvidia.time-slicing.replicas=2 settings.kubelet-device-plugins.nvidia.time-slicing.rename-by-default=true settings.kubelet-device-plugins.nvidia.time-slicing.fail-requests-greater-than-one=true
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "none",
        "pass-device-specs": true,
        "time-slicing": {
          "fail-requests-greater-than-one": true,
          "rename-by-default": true,
          "replicas": 2
        }
      }
    }
  }
}
  1. Bounded check:
bash-5.1# apiclient set settings.kubelet-device-plugins.nvidia.time-slicing.replicas=1
Failed to change settings: Failed PATCH request to '/settings/keypair?tx=apiclient-set-Ne2SavnsTF7Fweq0': Status 400 when PATCHing /settings/keypair?tx=apiclient-set-Ne2SavnsTF7Fweq0: Unable to match your input to the data model.  We may not have enough type information.  Please try the --json input form.  Cause: Error during deserialization: integer out of range, expected it to be between 2 and 2147483647 at line 1 column 107
bash-5.1#
  1. Files generated:
bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: volume-mounts
    deviceIDStrategy: index
sharing:
  timeSlicing:
    renameByDefault: true
    failRequestsGreaterThanOne: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 2
nvidia-device-plugin[360007]:   "resources": {
nvidia-device-plugin[360007]:     "gpus": [
nvidia-device-plugin[360007]:       {
nvidia-device-plugin[360007]:         "pattern": "*",
nvidia-device-plugin[360007]:         "name": "nvidia.com/gpu"
nvidia-device-plugin[360007]:       }
nvidia-device-plugin[360007]:     ]
nvidia-device-plugin[360007]:   },
nvidia-device-plugin[360007]:   "sharing": {
nvidia-device-plugin[360007]:     "timeSlicing": {
nvidia-device-plugin[360007]:       "renameByDefault": true,
nvidia-device-plugin[360007]:       "failRequestsGreaterThanOne": true,
nvidia-device-plugin[360007]:       "resources": [
nvidia-device-plugin[360007]:         {
nvidia-device-plugin[360007]:           "name": "nvidia.com/gpu",
nvidia-device-plugin[360007]:           "rename": "nvidia.com/gpu.shared",
nvidia-device-plugin[360007]:           "devices": "all",
nvidia-device-plugin[360007]:           "replicas": 2
nvidia-device-plugin[360007]:         }
nvidia-device-plugin[360007]:       ]
nvidia-device-plugin[360007]:     }
nvidia-device-plugin[360007]:   }
nvidia-device-plugin[360007]: }
  1. Time slicing on 1 instance with 1 GPU with 2 containers:
bash-5.1# apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=time-slicing settings.kubelet-device-plugins.nvidia.time-slicing.replicas=2 
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "time-slicing",
        "pass-device-specs": true,
        "time-slicing": {
          "replicas": 2
        }
      }
    }
  }
}

bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: volume-mounts
    deviceIDStrategy: index
sharing:
  timeSlicing:
    renameByDefault: true
    failRequestsGreaterThanOne: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 2
bash-5.1#

➜  kubectl apply -f node-1-gpu-2.yaml
pod/gpu-test-2-pod-node-one created

 ➜  kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
gpu-test-1-pod-node-one   1/1     Running   0          13m
gpu-test-2-pod-node-one   1/1     Running   0          3s

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Signed-off-by: Kyle Sessions <kssessio@amazon.com>
@KCSesh KCSesh marked this pull request as ready for review October 8, 2024 17:16
@KCSesh KCSesh merged commit 169984b into bottlerocket-os:develop Oct 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants