training across mutiple nodes does not work

if the number of GPUs > 8 (each node has 8 GPUs), then I have to train in several nodes

In this case, run by `mpiexec -n 16 python script/image_train.py` doesn't work.

It says the error of nccl