Closed
Description
Background
project: https://github.com/PaddlePaddle/Paddle/projects/55
Profiling script:
- add se resnet 152 profile script dzhwinter/benchmark#84
- add image resnet profile dzhwinter/benchmark#83
Optimization methods and result
- Delete unused GPU memory during training.
- It will slow down the training a little bit because GPU is an async device. However, it will reduce GPU memory huger than reusing variables. (reduce 54.3% memory usage than 45.5%) [Memory]More memory optimization policy #8690
- remove program.clone in Executor. (25% speedup) [Speed]speed up python executor in fluid #8729
- initialize NCCL once. (5%~6% speedup) [Speed]Avoid init_nccl for every steps. #8758
- use constant folding at compile time to reduce the number of calls to elementwise_mul ops at optimization time (5%~10% speedup) optimize optimizer learning rate #8873
- optimize elementwise related op -- use our own implementations, no longer depend on Eigen (speedup x10 for single op) [Speed] Optimize elementwise_mul_op gradient functor #8811
Status
- multi cards training has not been fully tested.
- need to profile acceleration ratio for multi cards.
Plan
Give a total profile after all the optimization is merged (@chengduoZH )
Metadata
Metadata
Assignees
Labels
No labels