SE-ResNeXt Optimization

## Background
project: https://github.com/PaddlePaddle/Paddle/projects/55
Profiling script:
- https://github.com/dzhwinter/benchmark/pull/84
- https://github.com/dzhwinter/benchmark/pull/83

## Optimization methods and result
1. Delete unused GPU memory during training.  
   * It will slow down the training a little bit because GPU is an async device. However, it will reduce GPU memory huger than reusing variables. (reduce 54.3% memory usage than 45.5%) https://github.com/PaddlePaddle/Paddle/pull/8690
1. remove program.clone in Executor. (25% speedup) https://github.com/PaddlePaddle/Paddle/issues/8729
1. initialize NCCL once. (5%~6% speedup) https://github.com/PaddlePaddle/Paddle/issues/8758
1.  use constant folding at compile time to reduce the number of calls to elementwise_mul ops at optimization time (5%~10% speedup) https://github.com/PaddlePaddle/Paddle/issues/8873
1. optimize elementwise related op -- use our own implementations, no longer depend on Eigen (speedup x10 for single op) https://github.com/PaddlePaddle/Paddle/issues/8811

## Status
1. multi cards training has not been fully tested.
1. need to profile acceleration ratio for multi cards.

## Plan
Give a total profile after all the optimization is merged (@chengduoZH )


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SE-ResNeXt Optimization #8990

Background

Optimization methods and result

Status

Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SE-ResNeXt Optimization #8990

Description

Background

Optimization methods and result

Status

Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions