Closed
Description
Running vgg16 with cifar10 dataset. Using kubectl
to submit a fluid cluster job with 5 pservers and 5 trainers. Trainers request 1 GPU each using alpha.kubernetes.io/nvidia-gpu: 1
CUDA: 8
cuDNN: 5
driver version: 375.26
GPU: P40
HostNetwork
Additional information: I see that CPU usage is up to 100% for a long time in the container, may be the CPU becomes the bottle neck?
Per mini-batch time: around 60s
When CPU only, it's arount 10s.
-------------------------> Profiling Report <-------------------------
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::split 5865 1.11765e+07 2.63047 6652.4 1905.63
thread0::concat 5865 1.09052e+07 2.61659 6175.19 1859.36
thread0::send 391 2.29786e+06 327.89 13663.4 5876.87
thread0::conv2d_grad 5083 893.141 0.065567 104.159 0.175711
thread0::conv2d 5083 807.148 0.051993 11.0981 0.158794
thread0::fill_zeros_like 25806 562.583 0.012788 11.0516 0.0218005
thread0::batch_norm 5474 525.538 0.055927 6.09994 0.0960062
thread0::batch_norm_grad 5474 346.792 0.044622 9.09123 0.0633526
thread0::elementwise_add_grad 6256 341.264 0.037377 8.06849 0.0545499
thread0::elementwise_add 6256 295.606 0.024 7.93195 0.0472516
thread0::dropout 3910 191.713 0.033447 6.07088 0.0490315
thread0::pool2d 1955 183.506 0.036702 9.3676 0.0938649
thread0::mul 1173 158.41 0.035665 8.38415 0.135047
thread0::pool2d_grad 1955 151.755 0.041505 8.08374 0.0776243
thread0::relu 5474 143.952 0.015749 5.04012 0.0262974
thread0::dropout_grad 3910 131.019 0.022926 5.03294 0.0335086
thread0::relu_grad 5474 130.262 0.016308 0.196264 0.0237965
thread0::mul_grad 1173 125.795 0.055516 3.11973 0.107242
thread0::cast 782 34.6569 0.019949 0.660779 0.0443183
thread0::softmax 391 33.7846 0.043393 0.707031 0.0864057
thread0::fetch 782 27.9014 0.02143 0.06991 0.0356795
thread0::elementwise_mul 391 22.6289 0.029586 0.658594 0.0578745
thread0::sum 782 21.7157 0.015956 0.057544 0.0277695
thread0::mean 391 20.8824 0.017462 0.680731 0.0534077
thread0::cross_entropy 391 18.9823 0.023271 7.04633 0.0485482
Metadata
Metadata
Assignees
Labels
No labels