profile model
https://github.com/PaddlePaddle/models/blob/develop/fluid/image_classification/se_resnext.py
profile with timeline tool
single card


two cards with nccl

two cards without nccl

some simple conclusion:
1, there is a huge gap between forward and backward when nccl is used
2, there is a huge gap between each step.