Skip to content

Fluid distributed training performance is terrible using GPU #8119

Closed
@typhoonzero

Description

@typhoonzero

Running vgg16 with cifar10 dataset. Using kubectl to submit a fluid cluster job with 5 pservers and 5 trainers. Trainers request 1 GPU each using alpha.kubernetes.io/nvidia-gpu: 1

CUDA: 8
cuDNN: 5
driver version: 375.26
GPU: P40
HostNetwork

Additional information: I see that CPU usage is up to 100% for a long time in the container, may be the CPU becomes the bottle neck?

Per mini-batch time: around 60s
When CPU only, it's arount 10s.

------------------------->     Profiling Report     <-------------------------
Time unit: ms
Sorted by total time in descending order in the same thread
Event                            Calls       Total       Min.        Max.        Ave.
thread0::split                   5865        1.11765e+07 2.63047     6652.4      1905.63
thread0::concat                  5865        1.09052e+07 2.61659     6175.19     1859.36
thread0::send                    391         2.29786e+06 327.89      13663.4     5876.87
thread0::conv2d_grad             5083        893.141     0.065567    104.159     0.175711
thread0::conv2d                  5083        807.148     0.051993    11.0981     0.158794
thread0::fill_zeros_like         25806       562.583     0.012788    11.0516     0.0218005
thread0::batch_norm              5474        525.538     0.055927    6.09994     0.0960062
thread0::batch_norm_grad         5474        346.792     0.044622    9.09123     0.0633526
thread0::elementwise_add_grad    6256        341.264     0.037377    8.06849     0.0545499
thread0::elementwise_add         6256        295.606     0.024       7.93195     0.0472516
thread0::dropout                 3910        191.713     0.033447    6.07088     0.0490315
thread0::pool2d                  1955        183.506     0.036702    9.3676      0.0938649
thread0::mul                     1173        158.41      0.035665    8.38415     0.135047
thread0::pool2d_grad             1955        151.755     0.041505    8.08374     0.0776243
thread0::relu                    5474        143.952     0.015749    5.04012     0.0262974
thread0::dropout_grad            3910        131.019     0.022926    5.03294     0.0335086
thread0::relu_grad               5474        130.262     0.016308    0.196264    0.0237965
thread0::mul_grad                1173        125.795     0.055516    3.11973     0.107242
thread0::cast                    782         34.6569     0.019949    0.660779    0.0443183
thread0::softmax                 391         33.7846     0.043393    0.707031    0.0864057
thread0::fetch                   782         27.9014     0.02143     0.06991     0.0356795
thread0::elementwise_mul         391         22.6289     0.029586    0.658594    0.0578745
thread0::sum                     782         21.7157     0.015956    0.057544    0.0277695
thread0::mean                    391         20.8824     0.017462    0.680731    0.0534077
thread0::cross_entropy           391         18.9823     0.023271    7.04633     0.0485482

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions