Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Commit

Permalink
[K8s] Add multi-gpu support (#385)
Browse files Browse the repository at this point in the history
* Add multi-gpu support for k8

* Remove kubernetes gpu example

* Update kubernetes gpu README and templates

* Add Accelerator feature gate only for k8s > 1.6

* Parametrize kubernetes version checking function

* remove --feature-gates flag from kuberneteskubelet.service

* add test for VersionOrdinal
  • Loading branch information
ritazh authored and JackQuincy committed Jun 8, 2017
1 parent 45cfbc1 commit f859f72
Show file tree
Hide file tree
Showing 9 changed files with 129 additions and 6 deletions.
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This cluster definition examples demonstrate how to create a customized Docker E
* [DC/OS Walkthrough](dcos.md) - shows how to create a DC/OS enabled Docker cluster on Azure
* [Kubernetes Walkthrough](kubernetes.md) - shows how to create a Kubernetes enabled Docker cluster on Azure
* [Kubernetes Windows Walkthrough](kubernetes.windows.md) - shows how to create a hybrid Kubernetes Windows enabled Docker cluster on Azure.
* [Kubernetes with GPU support Walkthrough](kubernetes.gpu.md) - shows how to create a Kubernetes cluster with GPU support.
* [Swarm Walkthrough](swarm.md) - shows how to create a Swarm enabled Docker cluster on Azure
* [Swarm Mode Walkthrough](swarmmode.md) - shows how to create a Swarm Mode cluster on Azure
* [Custom VNET](../examples/vnet) - shows how to use a custom VNET
Expand Down
65 changes: 65 additions & 0 deletions docs/kubernetes.gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Microsoft Azure Container Service Engine - Kubernetes Multi-GPU support Walkthrough

## Deployment

Here are the steps to deploy a simple Kubernetes cluster with multi-GPU support:

1. [Install a Kubernetes cluster][Kubernetes Walkthrough](kubernetes.md) - shows how to create a Kubernetes cluster.
> NOTE: Make sure to configure the agent nodes with vm size `Standard_NC12` or above to utilize the GPUs
2. Install drivers:
* SSH into each node and run the following scripts :
install-nvidia-driver.sh
```
curl -L -sf https://raw.githubusercontent.com/ritazh/acs-k8s-gpu/master/install-nvidia-driver.sh | sudo sh
```

To verify, when you run `kubectl describe node <node-name>`, you should get something like the following:

```
Capacity:
alpha.kubernetes.io/nvidia-gpu: 2
cpu: 12
memory: 115505744Ki
pods: 110
```

3. Scheduling a multi-GPU container

* You need to specify `alpha.kubernetes.io/nvidia-gpu: 2` as a limit
* You need to expose the drivers to the container as a volume. If you are using TF original docker image, it is based on ubuntu 16.04, just like your cluster's VM, so you can just mount `/usr/bin` and `/usr/lib/x86_64-linux-gnu`, it's a bit dirty but it works. Ideally, improve the previous script to install the driver in a specific directory and only expose this one.

``` yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
labels:
app: gpu-test
spec:
volumes:
- name: binaries
hostPath:
path: /usr/bin/
- name: libraries
hostPath:
path: /usr/lib/x86_64-linux-gnu
containers:
- name: tensorflow
image: gcr.io/tensorflow/tensorflow:latest-gpu
ports:
- containerPort: 8888
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2
volumeMounts:
- mountPath: /usr/bin/
name: binaries
- mountPath: /usr/lib/x86_64-linux-gnu
name: libraries
```
To verify, when you run `kubectl describe pod <pod-name>`, you see get the following:

```
Successfully assigned gpu-test to k8s-agentpool1-10960440-1
```
1 change: 1 addition & 0 deletions docs/kubernetes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Microsoft Azure Container Service Engine - Kubernetes Walkthrough

* [Kubernetes Windows Walkthrough](kubernetes.windows.md) - shows how to create a Kubernetes cluster on Windows.
* [Kubernetes with GPU support Walkthrough](kubernetes.gpu.md) - shows how to create a Kubernetes cluster with GPU support.

## Deployment

Expand Down
4 changes: 3 additions & 1 deletion parts/kubernetesagentcustomdata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,9 @@ write_files:
KUBELET_REGISTER_SCHEDULABLE=true
KUBELET_NODE_LABELS=role=agent
KUBELET_POD_INFRA_CONTAINER_IMAGE={{WrapAsVariable "kubernetesPodInfraContainerSpec"}}
{{if IsKubernetesVersionGe "1.6.0"}}
KUBELET_FEATURE_GATES=--feature-gates=Accelerators=true
{{end}}

- path: "/etc/systemd/system/kubelet.service"
permissions: "0644"
Expand Down Expand Up @@ -153,4 +156,3 @@ runcmd:
- systemctl restart docker
- mkdir -p /etc/kubernetes/manifests
- usermod -aG docker {{WrapAsVariable "username"}}

2 changes: 1 addition & 1 deletion parts/kuberneteskubelet.service
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ ExecStart=/usr/bin/docker run \
--azure-container-registry-config=/etc/kubernetes/azure.json \
--hairpin-mode=promiscuous-bridge \
--network-plugin=${KUBELET_NETWORK_PLUGIN} \
--v=2
--v=2 ${KUBELET_FEATURE_GATES}

[Install]
WantedBy=multi-user.target
1 change: 0 additions & 1 deletion parts/kubernetesmastercustomdata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -317,4 +317,3 @@ runcmd:
- systemctl restart docker
- mkdir -p /etc/kubernetes/manifests
- usermod -aG docker {{WrapAsVariable "username"}}

37 changes: 37 additions & 0 deletions pkg/acsengine/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -477,6 +477,36 @@ func addSecret(m map[string]interface{}, k string, v interface{}, encode bool) {
}
}

// https://stackoverflow.com/a/18411978
func VersionOrdinal(version api.OrchestratorVersion) string {
// ISO/IEC 14651:2011
const maxByte = 1<<8 - 1
vo := make([]byte, 0, len(version)+8)
j := -1
for i := 0; i < len(version); i++ {
b := version[i]
if '0' > b || b > '9' {
vo = append(vo, b)
j = -1
continue
}
if j == -1 {
vo = append(vo, 0x00)
j = len(vo) - 1
}
if vo[j] == 1 && vo[j+1] == '0' {
vo[j+1] = b
continue
}
if vo[j]+1 > maxByte {
panic("VersionOrdinal: invalid version")
}
vo = append(vo, b)
vo[j]++
}
return string(vo)
}

// getTemplateFuncMap returns all functions used in template generation
func (t *TemplateGenerator) getTemplateFuncMap(cs *api.ContainerService) map[string]interface{} {
return template.FuncMap{
Expand All @@ -500,6 +530,13 @@ func (t *TemplateGenerator) getTemplateFuncMap(cs *api.ContainerService) map[str
return cs.Properties.OrchestratorProfile.OrchestratorType == api.DCOS &&
cs.Properties.OrchestratorProfile.OrchestratorVersion == api.DCOS190
},
"IsKubernetesVersionGe": func(version string) bool {
targetVersion := api.OrchestratorVersion(version)
targetVersionOrdinal := VersionOrdinal(targetVersion)
orchestratorVersionOrdinal := VersionOrdinal(cs.Properties.OrchestratorProfile.OrchestratorVersion)
return cs.Properties.OrchestratorProfile.OrchestratorType == api.Kubernetes &&
orchestratorVersionOrdinal >= targetVersionOrdinal
},
"RequiresFakeAgentOutput": func() bool {
return cs.Properties.OrchestratorProfile.OrchestratorType == api.Kubernetes
},
Expand Down
18 changes: 18 additions & 0 deletions pkg/acsengine/engine_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"github.com/Azure/acs-engine/pkg/api"
"github.com/Azure/acs-engine/pkg/api/v20160330"
"github.com/Azure/acs-engine/pkg/api/vlabs"
. "github.com/onsi/gomega"
)

const (
Expand Down Expand Up @@ -195,3 +196,20 @@ func addTestCertificateProfile(api *api.CertificateProfile) {
api.KubeConfigPrivateKey = "kubeConfigPrivateKey"
api.SetCAPrivateKey("")
}

func TestVersionOrdinal(t *testing.T) {
RegisterTestingT(t)
v162 := api.OrchestratorVersion("1.6.2")
v160 := api.OrchestratorVersion("1.6.0")
v153 := api.OrchestratorVersion("1.5.3")
v16 := api.OrchestratorVersion("1.6")

Expect(v162 > v160).To(BeTrue())
Expect(v160 < v162).To(BeTrue())
Expect(v153 < v160).To(BeTrue())

//testing with different version length
Expect(v16 < v162).To(BeTrue())
Expect(v16 > v153).To(BeTrue())

}
Loading

0 comments on commit f859f72

Please sign in to comment.