Horovod is a distributed training framework for TensorFlow, and it's provided by UBER. The goal of Horovod is to make distributed Deep Learning fast and easy to use. And it provides Horovod in Docker to streamline the installation process.
This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod master as a job, then discover the host list automatically.
- Kubernetes cluster v1.8+
You can download official Horovod Dockerfile, then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version.
# mkdir horovod-docker
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
# docker build -t horovod:latest horovod-docker
# Setup ssh key
export SSH_KEY_DIR=`mktemp -d`
cd $SSH_KEY_DIR
yes | ssh-keygen -N "" -f id_rsa
To run Horovod with GPU, you can create values.yaml
like below
# cat << EOF > ~/values.yaml
---
ssh:
useSecrets: true
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
master:
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
For most cases, the overlay network impacts the Horovod performance greatly, so we should apply Host Network
solution. To run Horovod with Host Network and GPU, you can create values.yaml
like below
# cat << EOF > ~/values.yaml
---
useHostNetwork: true
ssh:
useSecrets: true
port: 32222
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
master:
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
notice: the difference is that you should set
useHostNetwork
as true, then set another ssh port rather than22
To install the chart with the release name mnist
:
$ helm install --values ~/values.yaml --name mnist stable/horovod
To uninstall/delete the mnist
deployment:
$ helm delete mnist
The command removes all the Kubernetes components associated with the chart and deletes the release.
A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an incompatible breaking change needing manual actions.
This version removes the chart
label from the spec.selector.matchLabels
which is immutable since StatefulSet apps/v1beta2
. It has been inadvertently
added, causing any subsequent upgrade to fail. See helm#7726.
In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named my-release
:
$ kubectl delete statefulsets.apps --cascade=false my-release
The following table lists the configurable parameters of the Horovod chart and their default values.
Parameter | Description | Default |
---|---|---|
useHostNetwork |
Host network | false |
ssh.port |
The ssh port | 22 |
ssh.useSecrets |
Determine if using the secrets for ssh | false |
worker.number |
The worker's number | 5 |
worker.image.repository |
horovod worker image | uber/horovod |
worker.image.pullPolicy |
pullPolicy for the worker |
IfNotPresent |
worker.image.tag |
tag for the worker |
0.12.1-tf1.8.0-py3.5 |
resources |
pod resource requests & limits | {} |
worker.env |
worker's environment variables | {} |
master.image.repository |
horovod master image | uber/horovod |
master.image.tag |
tag for the master |
0.12.1-tf1.8.0-py3.5 |
master.image.pullPolicy |
image pullPolicy for the master image | IfNotPresent |
master.args |
master's args | {} |
master.env |
master's environment variables | {} |