Our experiments heavily rely on Docker
and Kubernetes
. For the detailed experimental environment setup, please refer to dockerfile under the environments
folder.
Some simple explanation of the arguments used in the code.
- Arguments related to distributed training:
- The
n_mpi_process
andn_sub_process
indicates the number of nodes and the number of GPUs for each node. The data-parallel wrapper is adapted and applied locally for each node.- Note that the exact mini-batch size for each MPI process is specified by
batch_size
, while the mini-batch size used for each GPU isbatch_size/n_sub_process
.
- Note that the exact mini-batch size for each MPI process is specified by
- The
world
describes the GPU topology of the distributed training, in terms of all GPUs used for the distributed training. - The
hostfile
frommpi
specifies the physical location of the MPI processes. - We provide two use cases here:
n_mpi_process=2
,n_sub_process=1
andworld=0,0
indicates that two MPI processes are running on 2 GPUs with the same GPU id. It could be either 1 GPU at the same node or two GPUs at different nodes, where the exact configuration is determined byhostfile
.n_mpi_process=2
,n_sub_process=2
andworld=0,1,0,1
indicates that two MPI processes are running on 4 GPUs and each MPI process uses GPU id 0 and id 1 (on 2 nodes).
- The
- Arguments related to communication compression:
- The
graph_topology
- The
optimizer
will decide the type of distributed training, e.g., centralized SGD, decentralized SGD - The
comm_op
specifies the communication compressor we can use, e.g.,sign+norm
,random-k
,top-k
. - The
consensus_stepsize
determines theconsensus stepsize
for different decentralized algorithms (e.g.parallel_choco
,deep_squeeze
).
- The
- Arguments related to learning:
- The
lr_scaleup
,lr_warmup
andlr_warmup_epochs
will decide if we want to scale up the learning rate, or warm up the learning rate. For more details, please checkpcode/create_scheduler.py
.
- The
The script below trains ResNet-20
with CIFAR-10
, as an example of centralized training algorithm CHOCO
. More examples can be found in exps
.
OMP_NUM_THREADS=2 MKL_NUM_THREADS=2 $HOME/conda/envs/pytorch-py3.6/bin/python run.py \
--arch resnet20 --optimizer parallel_choco \
--avg_model True --experiment demo --manual_seed 6 \
--data cifar10 --pin_memory True \
--batch_size 128 --base_batch_size 64 --num_workers 2 \
--num_epochs 300 --partition_data random --reshuffle_per_epoch True --stop_criteria epoch \
--n_mpi_process 16 --n_sub_process 1 --world 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 \
--on_cuda True --use_ipc False \
--lr 0.1 --lr_scaleup True --lr_warmup True --lr_warmup_epochs 5 \
--lr_scheduler MultiStepLR --lr_decay 0.1 --lr_milestones 150,225 \
--comm_op sign --consensus_stepsize 0.5 --compress_ratio 0.9 --quantize_level 16 --is_biased True \
--weight_decay 1e-4 --use_nesterov True --momentum_factor 0.9 \
--hostfile hostfile --graph_topology ring --track_time True --display_tracked_time True \
--python_path $HOME/conda/envs/pytorch-py3.6/bin/python --mpi_path $HOME/.openmpi/