Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PaddlePaddle Hackathon 5 No.48】StridedCopyKernel算子GPU性能优化-part1 #58033

Merged
merged 2 commits into from
Oct 16, 2023

Conversation

WintersMontagne10335
Copy link
Contributor

@WintersMontagne10335 WintersMontagne10335 commented Oct 12, 2023

PR types

Performance optimization

PR changes

OPs

Description

目前ContiguousKernel、StridedCopyKernel两个 kernel 都是通过 numel index 计算数据偏移地址,需要一个 for 循环做计算,计算偏移地址效率低,导致 kernel 性能差。

  • 开发环境:
  1. 设备:Tesla V100
  2. 环境:CUDA 10.2
  • 优化方法
  1. 依靠线程配置信息,减少除法与取余操作。目前Paddle通过 numel index 计算数据偏移地址,会有大量除法与取余操作。可以利用线程配置中Grid与Block的6个参数(受硬件支持),减少除法与取余操作
  2. 去除依赖。由于kMaxRank为9,对于rank大于6或者不满足block参数配置要求(block.x * block.y * block.z <= 1024 && block.z <= 64)的情况下,还是需要计算部分偏移地址。目前Paddle中的实现存在依赖关系。下一个循环只能等本循环的index_tmp计算完毕后才能进行,但其实各个循环间的计算从逻辑实现上来说是独立的,完全可以并行计算。本方法,与上一个优化方法搭配,可以大幅缩短运行时间
  3. 预处理访存偏移量(可选优化点)。借鉴MegEngine卷积算子预处理访存偏移量的优化思路
  4. 改变访存(可选优化点)。《CUDA_C优化详解》中提到,“应尽可能避免非单位跨度的全局内存访问”,对于stride比较特殊的情况,可以优化访存
  • 关联PR
  1. 【PaddlePaddle Hackathon 5 No.48】ContiguousKernel、StridedCopyKernel算子CPU、GPU性能优化 -part #57835

@paddle-bot
Copy link

paddle-bot bot commented Oct 12, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WintersMontagne10335
Copy link
Contributor Author

@wanghuancoder 老师这个算子怎么调用呀。

@wanghuancoder
Copy link
Contributor

StridedCopy

哦!这个我看了一下,这个kernel目前只在内部使用,没有单纯调用测试性能的渠道。因为这个算法与Contiguous一致。所以我直接合入吧。

@wanghuancoder wanghuancoder merged commit a9e4b68 into PaddlePaddle:develop Oct 16, 2023
@luotao1 luotao1 changed the title 【PaddlePaddle Hackathon 5 No.48】StridedCopyKernel算子GPU性能优化 【PaddlePaddle Hackathon 5 No.48】StridedCopyKernel算子GPU性能优化-part Oct 16, 2023
@luotao1 luotao1 changed the title 【PaddlePaddle Hackathon 5 No.48】StridedCopyKernel算子GPU性能优化-part 【PaddlePaddle Hackathon 5 No.48】StridedCopyKernel算子GPU性能优化-part1 Oct 16, 2023
wanghuancoder added a commit that referenced this pull request Oct 19, 2023
@wanghuancoder
Copy link
Contributor

这个PR存在问题导致PaddleDetection develop分支崩溃:

python tools/train.py -c configs/rtdetr/rtdetr_hgnetv2_l_6x_coco.yml -o worker_num=16 LearningRate.base_lr=0.0001 log_iter=1 use_gpu=True save_dir=./test_tipc/output/rtdetr_hgnetv2_l_6x_coco/benchmark_train/norm_train_gpus_5_autocast_fp32 epoch=1 pretrain_weights=https://bj.bcebos.com/v1/paddledet/models/rtdetr_hgnetv2_l_6x_coco.pdparams TrainReader.batch_size=16 filename=rtdetr_hgnetv2_l_6x_coco TrainReader.shuffle=False --enable_ce=True
还得辛苦你看一下

@WintersMontagne10335
Copy link
Contributor Author

@wanghuancoder 收到

@WintersMontagne10335
Copy link
Contributor Author

@wanghuancoder 这周末修完

wanghuancoder added a commit that referenced this pull request Oct 20, 2023
@WintersMontagne10335
Copy link
Contributor Author

@wanghuancoder 老师我编译安装PaddleDetection后,运行您给的代码:

python tools/train.py -c configs/rtdetr/rtdetr_hgnetv2_l_6x_coco.yml -o worker_num=16 LearningRate.base_lr=0.0001 log_iter=1 use_gpu=True save_dir=./test_tipc/output/rtdetr_hgnetv2_l_6x_coco/benchmark_train/norm_train_gpus_5_autocast_fp32 epoch=1 pretrain_weights=https://bj.bcebos.com/v1/paddledet/models/rtdetr_hgnetv2_l_6x_coco.pdparams TrainReader.batch_size=16 filename=rtdetr_hgnetv2_l_6x_coco TrainReader.shuffle=False --enable_ce=True

提示:
image

hitywt pushed a commit to hitywt/Paddle that referenced this pull request Oct 24, 2023
jiahy0825 pushed a commit to jiahy0825/Paddle that referenced this pull request Oct 26, 2023
jiahy0825 pushed a commit to jiahy0825/Paddle that referenced this pull request Oct 26, 2023
@WintersMontagne10335
Copy link
Contributor Author

@wanghuancoder @wanghuancoder 老师您好,您提到的StridedCopyKernel存在问题导致PaddleDetection develop分支崩溃,大概率是因为线程配置参数越界。目前已初步修复,但是需要再测试一下那个问题。您给的测试指令我直接运行不了,可能需要依赖特定的环境。有什么需要可以联系我~~

danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants