Skip to content

Conversation

@zhangboSJTU
Copy link
Contributor

@zhangboSJTU zhangboSJTU commented Apr 16, 2023

PR types

Performance optimization

PR changes

OPs

Description

This PR is the follow-up part of #52093

  • Thoroughly clean the ElementwiseType
  • Clean InT and make default axis = -1 in function BroadcastKernel
  • Remove redundant fast divmod computation to optimize BroadcastDataLoader and here are the result
  • Optimize dropout, drop_nd, drop_nd_grad and here are the result of drop_nd_grad
    Test broadcast performance with test_ternary_broadcast.cu on V100 16G cuda11.2 unit(ms)
shape dtype Before PR PR 52093 PR 52093 perf This PR This PR perf
[1, 2048, 3584] fp32 701.9777 695.8409 0.87% 701.38 0.08%
[1, 2048, 3584] bf16 579.3792 733.4986 -26.60% 580.93 -0.27%
[1, 2048, 3584] fp16 351.1904 521.2488 -48.42% 349.59 0.46%
[1, 256, 4, 256, 256] fp32 77.7347 80.6498 -3.75% 77.54 0.26%
[1, 256, 4, 256, 256] bf16 62.0256 77.7761 -25.39% 62.27 -0.39%
[1, 256, 4, 256, 256] fp16 43.1551 55.0177 -27.49% 43.54 -0.90%
[1, 256, 256] fp32 5.7952 6.0928 -5.14% 5.78 0.28%
[1, 256, 256] bf16 5.4688 5.84 -6.79% 5.41 1.00%
[1, 256, 256] fp16 5.1488 5.4718 -6.27% 5.11 0.68%

A100 40G res

shape dtype Before PR This PR This PR perf
[1, 2048, 3584] fp32 401.52 401.94 0.11%
[1, 2048, 3584] bf16 337.47 337.43 -0.01%
[1, 2048, 3584] fp16 311.33 311.34 0.00%
[1, 256, 4, 256, 256] fp32 50.54 50.40 -0.29%
[1, 256, 4, 256, 256] bf16 37.90 37.78 -0.34%
[1, 256, 4, 256, 256] fp16 35.02 34.88 -0.41%
[1, 256, 256] fp32 6.85 6.93 1.17%
[1, 256, 256] bf16 6.69 6.75 0.91%
[1, 256, 256] fp16 6.62 6.72 1.40%

Test dropout_nd_grad performance on V100 16G cuda11.2 unit(ms)
Configs from https://github.com/PaddlePaddle/benchmark/pull/1673/files

shape dtype axis p develop PR PR perf
[16, 22] fp32 [1] 0.5 7.25 3.86 46.74%
[16, 16, 16, 16] fp32 [0, 1] 0.5 7.75 4.08 47.36%
[32, 128, 768] fp32 [0] 0.1 49.01 34.97 28.66%
[16, 22] fp16 [1] 0.5 7.30 4.14 43.26%
[16, 16, 16, 16] fp16 [0, 1] 0.5 7.65 4.31 43.66%
[32, 128, 768] fp16 [0] 0.1 35.62 20.98 41.10%

@paddle-bot
Copy link

paddle-bot bot commented Apr 16, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@zhangboSJTU zhangboSJTU requested review from JamesLim-sy, YuanRisheng and phlrain and removed request for JamesLim-sy and YuanRisheng April 16, 2023 15:17
@zhangboSJTU zhangboSJTU changed the title Dropout opt clean bc in t Dropout optimize & clean broadcast inT and ElementwiseType Apr 16, 2023
@lanxianghit
Copy link
Contributor

最好补充一下A100相关性能数据

@zhangboSJTU
Copy link
Contributor Author

zhangboSJTU commented Apr 18, 2023

最好补充一下A100相关性能数据

Done

shaojiewang
shaojiewang previously approved these changes Apr 27, 2023
@zhangboSJTU zhangboSJTU force-pushed the dropout_opt_clean_BcInT branch from df56d7e to 6bdbf1c Compare April 27, 2023 19:02
@zhangboSJTU zhangboSJTU force-pushed the dropout_opt_clean_BcInT branch from ed40af6 to 751ce2a Compare April 28, 2023 02:36
Copy link
Contributor

@ZzSean ZzSean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for CI-OP-Benchmark

@raindrops2sea raindrops2sea merged commit d611e48 into PaddlePaddle:develop Apr 28, 2023
zhangboSJTU added a commit to zhangboSJTU/Paddle that referenced this pull request May 9, 2023
…dle#52969)

* change judgement for DropoutGradGPUKernelDriver

* add UnrollerWithoutVecSize and after this Loaddata to be refined

* pass unittest

* use same unroller with XPU

* BroadcastWithInt64Index

* BroadcastDataLoader template partial specialization

* fix compile errs in ROCms

* clean ElementwiseT and InT for BroadcastKernel

* default axis and clean inT

* remove redundant fast divmod computation

* optimize drop_nd & drop_nd_grad

* optimize BroadcastDataLoader bf16 fp16

* rm InT etc. after merge develop

* delete constexpr for windows ci

* fix conflict

* fix conflic with develop

* fix conflic

* new clean

* clean
XiaoguangHu01 pushed a commit that referenced this pull request May 10, 2023
…to Release/2.5 (#53623)

* Support different dtypes of inputs for broadcast for dropout optimization  (#52093)

* change judgement for DropoutGradGPUKernelDriver

* add UnrollerWithoutVecSize and after this Loaddata to be refined

* pass unittest

* use same unroller with XPU

* BroadcastWithInt64Index

* BroadcastDataLoader template partial specialization

* fix compile errs in ROCms

* PR comment

* dropout_nd_optimization (#51479)

* with printf

* add DropOutNdForwardKernel

* PR comment

* Dropout optimize & clean broadcast inT and ElementwiseType (#52969)

* change judgement for DropoutGradGPUKernelDriver

* add UnrollerWithoutVecSize and after this Loaddata to be refined

* pass unittest

* use same unroller with XPU

* BroadcastWithInt64Index

* BroadcastDataLoader template partial specialization

* fix compile errs in ROCms

* clean ElementwiseT and InT for BroadcastKernel

* default axis and clean inT

* remove redundant fast divmod computation

* optimize drop_nd & drop_nd_grad

* optimize BroadcastDataLoader bf16 fp16

* rm InT etc. after merge develop

* delete constexpr for windows ci

* fix conflict

* fix conflic with develop

* fix conflic

* new clean

* clean

* Fix xpu2 kp compile error (#53548)

* fix conflict

* conflict
@zhangboSJTU zhangboSJTU removed the request for review from JamesLim-sy September 19, 2023 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants