-
Couldn't load subscription status.
- Fork 5.9k
Dropout optimize & clean broadcast inT and ElementwiseType #52969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropout optimize & clean broadcast inT and ElementwiseType #52969
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
|
最好补充一下A100相关性能数据 |
Done |
b730437 to
15d2e3c
Compare
df56d7e to
6bdbf1c
Compare
ed40af6 to
751ce2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for CI-OP-Benchmark
…dle#52969) * change judgement for DropoutGradGPUKernelDriver * add UnrollerWithoutVecSize and after this Loaddata to be refined * pass unittest * use same unroller with XPU * BroadcastWithInt64Index * BroadcastDataLoader template partial specialization * fix compile errs in ROCms * clean ElementwiseT and InT for BroadcastKernel * default axis and clean inT * remove redundant fast divmod computation * optimize drop_nd & drop_nd_grad * optimize BroadcastDataLoader bf16 fp16 * rm InT etc. after merge develop * delete constexpr for windows ci * fix conflict * fix conflic with develop * fix conflic * new clean * clean
…to Release/2.5 (#53623) * Support different dtypes of inputs for broadcast for dropout optimization (#52093) * change judgement for DropoutGradGPUKernelDriver * add UnrollerWithoutVecSize and after this Loaddata to be refined * pass unittest * use same unroller with XPU * BroadcastWithInt64Index * BroadcastDataLoader template partial specialization * fix compile errs in ROCms * PR comment * dropout_nd_optimization (#51479) * with printf * add DropOutNdForwardKernel * PR comment * Dropout optimize & clean broadcast inT and ElementwiseType (#52969) * change judgement for DropoutGradGPUKernelDriver * add UnrollerWithoutVecSize and after this Loaddata to be refined * pass unittest * use same unroller with XPU * BroadcastWithInt64Index * BroadcastDataLoader template partial specialization * fix compile errs in ROCms * clean ElementwiseT and InT for BroadcastKernel * default axis and clean inT * remove redundant fast divmod computation * optimize drop_nd & drop_nd_grad * optimize BroadcastDataLoader bf16 fp16 * rm InT etc. after merge develop * delete constexpr for windows ci * fix conflict * fix conflic with develop * fix conflic * new clean * clean * Fix xpu2 kp compile error (#53548) * fix conflict * conflict
PR types
Performance optimization
PR changes
OPs
Description
This PR is the follow-up part of #52093
ElementwiseTypeInTand make default axis = -1 in functionBroadcastKernelBroadcastDataLoaderand here are the resultdropout,drop_nd,drop_nd_gradand here are the result ofdrop_nd_gradTest broadcast performance with test_ternary_broadcast.cu on V100 16G cuda11.2 unit(ms)
A100 40G res
Test dropout_nd_grad performance on V100 16G cuda11.2 unit(ms)
Configs from https://github.com/PaddlePaddle/benchmark/pull/1673/files