Skip to content

Conversation

@zhangboSJTU
Copy link
Contributor

@zhangboSJTU zhangboSJTU commented Mar 10, 2023

PR types

Performance optimization

PR changes

OPs

Describe

Dropout_nd performance optimization

  • CU11.7 V100
dims p axis dtype PR dev acc ratio
[20, 30, 50] 0.5 [1,2] fp32 11.8 17.8 51.36%
[200, 30, 500] 0.5 [1,2] fp32 47.7 59.9 25.63%
[2, 300, 50] 0.5 [1,2] fp32 7.3 17.9 146.56%
[32, 1024, 1024] 0.5 [1] fp32 422.3 496.8 17.64%
[20, 30, 50] 0.5 [1,2] fp16 8.2 18.2 219.30%
[200, 30, 500] 0.5 [1,2] fp16 37.2 45.4 22.04%
[2, 300, 50] 0.5 [1,2] fp16 5.7 18.8 129.27%
[32, 1024, 1024] 0.5 [1] fp16 353.6 388.0 9.73%

@paddle-bot
Copy link

paddle-bot bot commented Mar 10, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于非关键位置的代码,如不必调整,慎改.

T dst_mask[kCount]; // 0 ~ kCount - 1 : dst, kCount ~ 2 * kCount - 1: mask
float rands[kCount];
MaskType mask_result[kCount];
uint8_t mask_result[kCount];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaskType 已经作为 uint8_t ,由模板传入了,这里不需要替换成uint8_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

uint32_t offset = 0u;
uint32_t idx = i;
// Use (j < phi::DDim::kMaxRank) conditiion rather than
// (j < broadcast_config.rank) for (#pragma unroll)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#pragma unroll 的位置放错了,应该紧贴 for-loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面删除注释可能不小心调整位置了,根据建议修改

template <typename T>
struct DstFunctor {
using MT = typename phi::kps::details::MPTypeTrait<T>::Type;
MT factor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factor 调整为 private 变量,如果HOSTDEVICE inline DstFunctor(const float retain_prob, 是在Host上执行的话,不需要HOSTDEVICE inline

Copy link
Contributor

@JamesLim-sy JamesLim-sy Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DstFunctor是已有的代码,没有任何需要调整的场景下,不要挪动位置

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前代码functor和函数排列太混乱,删除无用函数后调整了下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factor 调整为 private 变量,如果HOSTDEVICE inline DstFunctor(const float retain_prob, 是在Host上执行的话,不需要HOSTDEVICE inline

done

}
}
};

Copy link
Contributor

@JamesLim-sy JamesLim-sy Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,已有的代码,如果不涉及大面积修改,不要移动

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,以后注意

Copy link
Contributor

@JamesLim-sy JamesLim-sy Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就在这里改回去,别以后了

if (rand[i] < retain_prob_) {
dst[i] = static_cast<T>(1);
} else {
dst[i] = static_cast<T>(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dst[i] = (rand[i] < retain_prob_) ? static_cast<T>(1) : static_cast<T>(0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
};

template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么取消了
template <typename T1, typename T2 = T1, typename OutT = T1>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该函数只在此处使用,没必要加过多无用模版

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原始代码中存在kps::OperatorTernary<T, float, T, DstMaskFunctor<T, float>> ,感觉又是一个无必要修改的地方

};

template <typename T, typename MaskType>
template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typename MaskType保留,显示的uint8_t 代码后续失去了快速扩展性.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑到这个函数的功能特定,优化后又做了kernel融合,不需要过多的模版参数

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那就没有了扩展性,不建议使用uint8_t

@JamesLim-sy
Copy link
Contributor

PR-CI-ROCM-Compile 我rerun了两轮还是有问题,找个设备检查下

@zhangboSJTU
Copy link
Contributor Author

PR-CI-ROCM-Compile 我rerun了两轮还是有问题,找个设备检查下

报错同
#pragma unroll 的位置放错

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, 非必要修改的内容,谨慎修改,提升工作效率

@JamesLim-sy JamesLim-sy merged commit 65e3fa3 into PaddlePaddle:develop Mar 20, 2023
@zhangboSJTU zhangboSJTU deleted the dropout_optimize branch March 23, 2023 08:38
zhangboSJTU added a commit to zhangboSJTU/Paddle that referenced this pull request May 9, 2023
* with printf

* add DropOutNdForwardKernel

* PR comment
XiaoguangHu01 pushed a commit that referenced this pull request May 10, 2023
…to Release/2.5 (#53623)

* Support different dtypes of inputs for broadcast for dropout optimization  (#52093)

* change judgement for DropoutGradGPUKernelDriver

* add UnrollerWithoutVecSize and after this Loaddata to be refined

* pass unittest

* use same unroller with XPU

* BroadcastWithInt64Index

* BroadcastDataLoader template partial specialization

* fix compile errs in ROCms

* PR comment

* dropout_nd_optimization (#51479)

* with printf

* add DropOutNdForwardKernel

* PR comment

* Dropout optimize & clean broadcast inT and ElementwiseType (#52969)

* change judgement for DropoutGradGPUKernelDriver

* add UnrollerWithoutVecSize and after this Loaddata to be refined

* pass unittest

* use same unroller with XPU

* BroadcastWithInt64Index

* BroadcastDataLoader template partial specialization

* fix compile errs in ROCms

* clean ElementwiseT and InT for BroadcastKernel

* default axis and clean inT

* remove redundant fast divmod computation

* optimize drop_nd & drop_nd_grad

* optimize BroadcastDataLoader bf16 fp16

* rm InT etc. after merge develop

* delete constexpr for windows ci

* fix conflict

* fix conflic with develop

* fix conflic

* new clean

* clean

* Fix xpu2 kp compile error (#53548)

* fix conflict

* conflict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants