Skip to content

ParallelDo performance on VGG #8719

Closed
Closed
@tonyyang-svail

Description

@tonyyang-svail

Major take aways:

  1. Parameter copy is still a big bottleneck (large net large VGG16, Memcpy takes up to 80%)
  2. We do need multiple streams (AllReduce Kernel takes up 70% of the total kernel time)
  3. NCCLInit should not be called at every iteration. It takes about 70ms for one GPU and 90ms for four GPUs.

Background

test script, command line

Net: VGG16
Model Size: 409M (The original definition of VGG16 net is incorrect #8718)
Batchsize: 16 for each GPU
BatchNorm: OFF. Since parallel_do doesn't support this.

Inputs are randomly generated on each GPU. So no overhead on copying training data to different devices

Result

Time unit: milliseconds.

# GPUs copy weights (ms) forward and backward (ms) merge gradient (ms) apply gradient (ms) total
1 N/A 130 N/A 5
1 NCCL in backward N/A 220 N/A 5
4 350 130 350 5
4 NCCL in backward 350 650(AllReduce takes about 70%) N/A 5

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions