[Executor] Fix `builtin_combine_instruction` always in device thread #68151

SigureMo · 2024-09-11T08:23:21Z

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

修复执行器 place 为 GPU 的情况下，builtin_combine_instruction 永远会分到 device thread 的问题

比如对于如下 program

LoweredProgram( AfterPass ) is :
{
    (%0) = "data(phi_kernel)" () {dtype:(pd_op.DataType)int64,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"data",name:"_jst.0.x.0",op_name:"pd_op.data",origin_id:(Int64)35,place:(pd_op.Place)Place(gpu:0),shape:(pd_op.IntArray)[3],stop_gradient:[true]} : () -> gpu_tensor<3xi64>
    (%1) = "data(phi_kernel)" () {dtype:(pd_op.DataType)int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"data",name:"_jst.1.y.0",op_name:"pd_op.data",origin_id:(Int64)36,place:(pd_op.Place)Place(cpu),shape:(pd_op.IntArray)[],stop_gradient:[true]} : () -> cpu_tensor<i64>
    (%2) = "full(phi_kernel)" () {dtype:(pd_op.DataType)float32,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:float32>,kernel_name:"full",op_name:"pd_op.full",origin_id:(Int64)37,place:(pd_op.Place)Place(cpu),shape:(pd_op.IntArray)[1],stop_gradient:[true],value:(Double)1} : () -> cpu_tensor<1xf32>
    (%3) = "scale(phi_kernel)" (%1, %2) {bias:(Float)1,bias_after_scale:true,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"scale",op_name:"pd_op.scale",origin_id:(Int64)38,stop_gradient:[true]} : (cpu_tensor<i64>, cpu_tensor<1xf32>) -> cpu_tensor<i64>
    (%4) = "builtin.combine" (%1) {origin_id:(Int64)39,stop_gradient:[true]} : (cpu_tensor<i64>) -> vec[cpu_tensor<i64>]
    (%5) = "stack(phi_kernel)" (%4) {axis:(Int32)0,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"stack",op_name:"pd_op.stack",origin_id:(Int64)40,stop_gradient:[true]} : (vec[cpu_tensor<i64>]) -> cpu_tensor<1xi64>
    (%6) = "builtin.combine" (%3) {origin_id:(Int64)41,stop_gradient:[true]} : (cpu_tensor<i64>) -> vec[cpu_tensor<i64>]
    (%7) = "stack(phi_kernel)" (%6) {axis:(Int32)0,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"stack",op_name:"pd_op.stack",origin_id:(Int64)42,stop_gradient:[true]} : (vec[cpu_tensor<i64>]) -> cpu_tensor<1xi64>
    (%8) = "slice(phi_kernel)" (%0, %5, %7) {axes:[(Int64)0],decrease_axis:[(Int64)0],infer_flags:[(Int64)-1],kernel_key:<backend:GPU|layout:NCHW|dtype:int64>,kernel_name:"slice",op_name:"pd_op.slice",origin_id:(Int64)43,stop_gradient:[true]} : (gpu_tensor<3xi64>, cpu_tensor<1xi64>, cpu_tensor<1xi64>) -> gpu_tensor<i64>
    () = "builtin.shadow_output" (%8) {origin_id:(Int64)44,output_name:"output_0"} : (gpu_tensor<i64>) -> 
}

combine OP 输入输出都是 CPU Tensor，但是仍然被分到了 device thread，导致中间切换线程有额外的开销

修复后 combine 正确分在了 host thread

Pcard-67164

paddle-bot · 2024-09-11T08:23:26Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

[Executor] Fix builtin_combine_instruction always in device thread

83824f8

zhangbo9674 approved these changes Sep 12, 2024

View reviewed changes

SigureMo merged commit 6382460 into PaddlePaddle:develop Sep 13, 2024
29 of 30 checks passed

SigureMo deleted the executor/fix-builtin-combine-instr-always-in-device-thread branch September 13, 2024 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Executor] Fix `builtin_combine_instruction` always in device thread #68151

[Executor] Fix `builtin_combine_instruction` always in device thread #68151

Uh oh!

SigureMo commented Sep 11, 2024

Uh oh!

paddle-bot bot commented Sep 11, 2024

Uh oh!

Uh oh!

Uh oh!

[Executor] Fix builtin_combine_instruction always in device thread #68151

[Executor] Fix builtin_combine_instruction always in device thread #68151

Uh oh!

Conversation

SigureMo commented Sep 11, 2024

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Sep 11, 2024

Uh oh!

Uh oh!

Uh oh!

[Executor] Fix `builtin_combine_instruction` always in device thread #68151

[Executor] Fix `builtin_combine_instruction` always in device thread #68151