Skip to content

[Executor] Fix builtin_combine_instruction always in device thread #68151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

SigureMo
Copy link
Member

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

修复执行器 place 为 GPU 的情况下,builtin_combine_instruction 永远会分到 device thread 的问题

比如对于如下 program

LoweredProgram( AfterPass ) is :
{
    (%0) = "data(phi_kernel)" () {dtype:(pd_op.DataType)int64,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"data",name:"_jst.0.x.0",op_name:"pd_op.data",origin_id:(Int64)35,place:(pd_op.Place)Place(gpu:0),shape:(pd_op.IntArray)[3],stop_gradient:[true]} : () -> gpu_tensor<3xi64>
    (%1) = "data(phi_kernel)" () {dtype:(pd_op.DataType)int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"data",name:"_jst.1.y.0",op_name:"pd_op.data",origin_id:(Int64)36,place:(pd_op.Place)Place(cpu),shape:(pd_op.IntArray)[],stop_gradient:[true]} : () -> cpu_tensor<i64>
    (%2) = "full(phi_kernel)" () {dtype:(pd_op.DataType)float32,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:float32>,kernel_name:"full",op_name:"pd_op.full",origin_id:(Int64)37,place:(pd_op.Place)Place(cpu),shape:(pd_op.IntArray)[1],stop_gradient:[true],value:(Double)1} : () -> cpu_tensor<1xf32>
    (%3) = "scale(phi_kernel)" (%1, %2) {bias:(Float)1,bias_after_scale:true,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"scale",op_name:"pd_op.scale",origin_id:(Int64)38,stop_gradient:[true]} : (cpu_tensor<i64>, cpu_tensor<1xf32>) -> cpu_tensor<i64>
    (%4) = "builtin.combine" (%1) {origin_id:(Int64)39,stop_gradient:[true]} : (cpu_tensor<i64>) -> vec[cpu_tensor<i64>]
    (%5) = "stack(phi_kernel)" (%4) {axis:(Int32)0,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"stack",op_name:"pd_op.stack",origin_id:(Int64)40,stop_gradient:[true]} : (vec[cpu_tensor<i64>]) -> cpu_tensor<1xi64>
    (%6) = "builtin.combine" (%3) {origin_id:(Int64)41,stop_gradient:[true]} : (cpu_tensor<i64>) -> vec[cpu_tensor<i64>]
    (%7) = "stack(phi_kernel)" (%6) {axis:(Int32)0,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"stack",op_name:"pd_op.stack",origin_id:(Int64)42,stop_gradient:[true]} : (vec[cpu_tensor<i64>]) -> cpu_tensor<1xi64>
    (%8) = "slice(phi_kernel)" (%0, %5, %7) {axes:[(Int64)0],decrease_axis:[(Int64)0],infer_flags:[(Int64)-1],kernel_key:<backend:GPU|layout:NCHW|dtype:int64>,kernel_name:"slice",op_name:"pd_op.slice",origin_id:(Int64)43,stop_gradient:[true]} : (gpu_tensor<3xi64>, cpu_tensor<1xi64>, cpu_tensor<1xi64>) -> gpu_tensor<i64>
    () = "builtin.shadow_output" (%8) {origin_id:(Int64)44,output_name:"output_0"} : (gpu_tensor<i64>) -> 
}

combine OP 输入输出都是 CPU Tensor,但是仍然被分到了 device thread,导致中间切换线程有额外的开销

image

修复后 combine 正确分在了 host thread

image

Pcard-67164

Copy link

paddle-bot bot commented Sep 11, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@SigureMo SigureMo merged commit 6382460 into PaddlePaddle:develop Sep 13, 2024
29 of 30 checks passed
@SigureMo SigureMo deleted the executor/fix-builtin-combine-instr-always-in-device-thread branch September 13, 2024 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants