To use Horovod on GPU, read the options below and see which one applies to you best.
In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the allreduce operation optimized for NVIDIA GPUs and a variety of networking devices, such as RoCE or InfiniBand.
- Install NCCL 2.
Steps to install NCCL 2 are listed here.
If you have installed NCCL 2 using the nccl-<version>.txz
package, you should add the library path to LD_LIBRARY_PATH
environment variable or register it in /etc/ld.so.conf
.
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib
- (Optional) If you're using an NVIDIA Tesla GPU and NIC with GPUDirect RDMA support, you can further speed up NCCL 2 by installing an nv_peer_memory driver.
GPUDirect allows GPUs to transfer memory among each other without CPU involvement, which significantly reduces latency and load on CPU. NCCL 2 is able to use GPUDirect automatically for allreduce operation if it detects it.
- Install Open MPI or another MPI implementation.
Steps to install Open MPI are listed here.
- Install the
horovod
pip package.
If you have installed NCCL 2 using the nccl-<version>.txz
package, you should specify the path to NCCL 2 using the HOROVOD_NCCL_HOME
environment variable.
$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
If you have installed NCCL 2 using the Ubuntu package, you can simply run:
$ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
Note: Some models with a high computation to communication ratio benefit from doing allreduce on CPU, even if a
GPU version is available. To force allreduce to happen on CPU, pass device_dense='/cpu:0'
to hvd.DistributedOptimizer
:
opt = hvd.DistributedOptimizer(opt, device_dense='/cpu:0')
This section is only relevant if you have a proprietary MPI implementation with GPU support, i.e. not Open MPI or MPICH. Most users should follow one of the sections above.
If your MPI vendor's implementation of allreduce operation on GPU is faster than NCCL 2, you can configure Horovod to use it instead:
$ HOROVOD_GPU_ALLREDUCE=MPI pip install --no-cache-dir horovod
Additionally, if your MPI vendor's implementation supports allgather and broadcast operations on GPU, you can configure Horovod to use them as well:
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod
Note: Allgather allocates an output tensor which is proportionate to the number of processes participating in the
training. If you find yourself running out of GPU memory, you can force allreduce to happen on CPU by passing
device_sparse='/cpu:0'
to hvd.DistributedOptimizer
:
opt = hvd.DistributedOptimizer(opt, device_sparse='/cpu:0')