Fluid distributed training performance is terrible using GPU

Running vgg16 with cifar10 dataset. Using `kubectl` to submit a fluid cluster job with 5 pservers and 5 trainers. Trainers request 1 GPU each using `alpha.kubernetes.io/nvidia-gpu: 1`

CUDA: 8
cuDNN: 5
driver version: 375.26
GPU: P40
HostNetwork

Additional information: I see that CPU usage is up to 100% for a long time in the container, may be the CPU becomes the bottle neck? 


Per mini-batch time: around 60s
When CPU only, it's arount 10s.
```
------------------------->     Profiling Report     <-------------------------
Time unit: ms
Sorted by total time in descending order in the same thread
Event                            Calls       Total       Min.        Max.        Ave.
thread0::split                   5865        1.11765e+07 2.63047     6652.4      1905.63
thread0::concat                  5865        1.09052e+07 2.61659     6175.19     1859.36
thread0::send                    391         2.29786e+06 327.89      13663.4     5876.87
thread0::conv2d_grad             5083        893.141     0.065567    104.159     0.175711
thread0::conv2d                  5083        807.148     0.051993    11.0981     0.158794
thread0::fill_zeros_like         25806       562.583     0.012788    11.0516     0.0218005
thread0::batch_norm              5474        525.538     0.055927    6.09994     0.0960062
thread0::batch_norm_grad         5474        346.792     0.044622    9.09123     0.0633526
thread0::elementwise_add_grad    6256        341.264     0.037377    8.06849     0.0545499
thread0::elementwise_add         6256        295.606     0.024       7.93195     0.0472516
thread0::dropout                 3910        191.713     0.033447    6.07088     0.0490315
thread0::pool2d                  1955        183.506     0.036702    9.3676      0.0938649
thread0::mul                     1173        158.41      0.035665    8.38415     0.135047
thread0::pool2d_grad             1955        151.755     0.041505    8.08374     0.0776243
thread0::relu                    5474        143.952     0.015749    5.04012     0.0262974
thread0::dropout_grad            3910        131.019     0.022926    5.03294     0.0335086
thread0::relu_grad               5474        130.262     0.016308    0.196264    0.0237965
thread0::mul_grad                1173        125.795     0.055516    3.11973     0.107242
thread0::cast                    782         34.6569     0.019949    0.660779    0.0443183
thread0::softmax                 391         33.7846     0.043393    0.707031    0.0864057
thread0::fetch                   782         27.9014     0.02143     0.06991     0.0356795
thread0::elementwise_mul         391         22.6289     0.029586    0.658594    0.0578745
thread0::sum                     782         21.7157     0.015956    0.057544    0.0277695
thread0::mean                    391         20.8824     0.017462    0.680731    0.0534077
thread0::cross_entropy           391         18.9823     0.023271    7.04633     0.0485482
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluid distributed training performance is terrible using GPU #8119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fluid distributed training performance is terrible using GPU #8119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions