[CINN] TileBroadcastTactic NHWC layout broadcast support #71434

Enigmatisms · 2025-03-05T11:51:45Z

NOTE: the same PR is posted again in #71464, since the current PR has some problems with git history. Therefore the current PR is discarded.

PR Category

CINN

PR Types

Improvements

Description

Implement the NHWC layout support for TileBroadcastTactic introduced in #70092.

Introduction

The original TileBroadcastTactic only handles the broadcast of the following form:

   [1, C, 1, 1] => [N, C, H, W].

For NHWC layout broadcast, CINN will fall back to TileFirstGeneralTactic since the last axis is not a broadcast axis. It will have some performance issues, such as using a rather static block size for tensor with different channels, which might produces excessive GMEM load.

The extended TileBroadcastTactic also covers the following case:

[1, 1, 1, C] => [N, H, W, C]

as long as the last axis is a preserved axis (see #70092 for the term definition).

Performance Impact

This tactic extension avoids excessive load in the a simple way: find an appropriate block size K, which satifies:

C % K is 0
K is a multiple of 32
K should be as close to C as possible.
K is within a certain range, for example: [128, 1024].

The first requirement is the most important one: it will eliminate excessive loads by making the load index invariant to the thread coarsening loop index. Here is an example:

Without this extension, for the input tensor with shape (64, 56, 56, 192), the block size is 256, therefore, with thread coarsening, the thread will load one value from GMEM each loop iteration, totaling 4 loads per tensor (that needs to broadcasting).

// example 192-> (64, 56, 56, 192) tensor broadcast
  for (int32_t thread_loop_i = 0; thread_loop_i < 4; thread_loop_i += 1) {
    float var_0_local = var_0[((((thread_loop_i * 256) + (int)threadIdx.x) + ((int)blockIdx.x * 1024)) % 192)];
    /* ... */
    to_broad_cast[/* ... */] = some_func(var_0_local, /* ... */);
}

With this extension, we only need to load once, since the load index is loop-index invariant. Effectively reducing the number of loads required:

// example 192-> (64, 56, 56, 192) tensor broadcast
  for (int32_t thread_loop_i = 0; thread_loop_i < 4; thread_loop_i += 1) {
    float var_0_local = var_0[threadIdx.x];
    /* ... */
    to_broad_cast[/* ... */] = some_func(var_0_local, /* ... */);
}

Limitation

For tensors with large C, since the block size can not be too high (otherwise the occupancy will be bad), the load index won't be invariant and the loaded data won't be reused. In this tactic extension, we offered a simple (but not perfect) solution: disabling thread coarsening to reduce register requirement, which will in turn increase occupancy. To solve the problem: register requirement must be dealt with.
This extension supports broadcast with any shape, as long as the broadcast form is [B, P]. For example: (1, H, 1, C) -> (N, H, W, C). But this is not locally tested and can be potentially erroneous.

Experiment Results

Tested on BatchNorm op (data_layout="NHWC"), the forward broadcast kernel (V100):

shape	max bandwidth %	runtime (us)	bandwidth improvement (%)	runtime decrease (%)
256,64,64,192	92.26	1940	35.56	-26.24
128,72,72,224	92.33	1430	64.43	-39.14
256,48,48,224	92.31	1270	67.29	-40.37
400,72,72,384	92.30	7680	62.10	-38.13

Pcard-89620

* PP shared layer with multi attrs * update to all equal

…addlePaddle#71132)

) support dynamic cutlass op gemm_epilogue (PaddlePaddle#71135)

* split device file * fix path * add setup.py * split device_event_def.h

… ir-trt into pir-trt. (PaddlePaddle#70961) * fix * add pd_op.atan,tan,asin,acos * fix pd_op.full_like and add pd_op.atan,pd_op.tan,pd_op.acos,pd_op.asin * fix * fix * fix * fix * fix * fix * fix pd_op.pool2d * fix * fix pd_op.pool2d * fix * fix * 增加trt_config.allow_only_specify_trt_ops * fix

* pd_op.linear_interp * fix * fix * fix * 增加单测覆盖率 * fix * fix * fix

* support int8 quant in trt * support int8 quant in trt * fix coverage * perfect code

* add send recv * fix * remove assert in reshard func * update p_recv_kernel.cu * recover copyright * add send_recv functor * fix include error * fix build error * update THROW message

* Support XPU for static auto-parallel * Remove logs

* split device file * fix path * add setup.py * split device_event_def.h * modify path

…ddle#71089) * add test * fix win * fix * add test * fix test

* update pop/push instruction * delete std::cout

* longlong2int for dynamic shape * change cuda func args type * add args for grid reduce * fix bug for ci * remove ! in func name * refine code * ir copy on host module args * update dynamic cast * fix comment * polish code * refine code

…#71156)

…addle#71161) * fix * fix codestyle

…addlePaddle#71154) * fix * fix * fix * fix

…e#71142) * add gard_api inplace version * split api.h and backward_api.h * modify build wrong * modify build bug * change backward_api_yaml args * change backward_api_yaml args * modify impl to base

…addlePaddle#71166) This reverts commit 64d227c.

PaddlePaddle#71114)

…le#71157) * add pow and index_put * fix codestyle * Update test_converter_math.py * update * fix codestyle

…addlePaddle#70883) * Update transform_gpu_forloop pass * Update op_lowering_impl.cc * Update CudaSyncThreadsDropIfThenElse pass * Disable EliminateCommonGlobalMemoryRead pass

…e#71384) This reverts commit 41f1bd8.

…dlePaddle#71401) --------- Co-authored-by: zyfncg <zhangyunfei07@baidu.com>

…ddlePaddle#71212) * support: each parameter has different lr in merged_momentum * fix code style * support only one lr in merged_momentum * support only one lr in merged_momentum * limit: lr should be either 1 or len of param * fix code style * add some test cases for merged_moementum

--------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* channel align 8 * reset factor to 4

…le#71412)

…pVariable` (PaddlePaddle#71407)

)

* remove cas head file * fix ci

…dictor will use it in customplace (PaddlePaddle#71362) * add customdevice default_pass * add declare * rm stdmove * add custom_load pass

* add UT: test_selected_high_order_derivative * remove default axis = 0 * update error msg * update op_compat.yaml * update UT * update code * only run UT in gpu

…addle#71408)

…lShape (PaddlePaddle#71320) * fix * fix * handle dynamic shape in PIR infer_local_shape and infer_global_shape and add PIR nd_mesh_alltoall reshard function * fix bugs in static auto parallel --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com>

* fix profile tool * fix profile tool

…71405)

* [XPU] add isfinite/isinf support * fix test * fix

paddle-bot · 2025-03-05T11:51:51Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant · 2025-03-06T05:47:46Z

All committers have signed the CLA.

Enigmatisms and others added 30 commits March 5, 2025 09:31

[CINN] TileBroadcastTactic for NHWC layout (vbeta)

c4e6acd

fix moe_dispatch infermeta (PaddlePaddle#71140)

eb8e0d6

PP shared layer with multi attrs (PaddlePaddle#71107)

de56f53

* PP shared layer with multi attrs * update to all equal

[XPU] bfloat16 support for gather/gather_grad/scatter/scatter_grad (P…

9b43a37

…addlePaddle#71132)

Fix typos symbole symbol (PaddlePaddle#71113)

08f73f6

[Inference] support dynamic cutlass op gemm_epilogue (PaddlePaddle#71135

7902350

) support dynamic cutlass op gemm_epilogue (PaddlePaddle#71135)

【phi】Split device_event_def.h as .cc (PaddlePaddle#71136)

dfbf652

* split device file * fix path * add setup.py * split device_event_def.h

【Paddle TensorRT】pd_op.linear_interp (PaddlePaddle#71013)

125b532

* pd_op.linear_interp * fix * fix * fix * 增加单测覆盖率 * fix * fix * fix

fix reduce as to sum pass (PaddlePaddle#71137)

d1cfd4a

fix COMPLEX64 is op input for paddle-trt (PaddlePaddle#71143)

9ae6b9a

[Inference]Support INT8 quant in PIR-TRT (PaddlePaddle#71127)

45e0442

* support int8 quant in trt * support int8 quant in trt * fix coverage * perfect code

[XPU] add phi kernel for send&recv (PaddlePaddle#71131)

8eadba5

* add send recv * fix * remove assert in reshard func * update p_recv_kernel.cu * recover copyright * add send_recv functor * fix include error * fix build error * update THROW message

Support XPU for static auto-parallel (PaddlePaddle#71126)

2f94b57

* Support XPU for static auto-parallel * Remove logs

【custom 】modify Set up path (PaddlePaddle#71151)

9fcb25b

* split device file * fix path * add setup.py * split device_event_def.h * modify path

【Paddle Tensor 规范化第二期】add,sub,div,mul support 0-size tensor (PaddlePa…

6b53f80

…ddle#71089) * add test * fix win * fix * add test * fix test

Update pop/push instruction for nulltype (PaddlePaddle#71133)

7bc4a83

* update pop/push instruction * delete std::cout

add_spmd (PaddlePaddle#71092)

9593508

fix (PaddlePaddle#71152)

95e6856

[Dy2St] Skip early gc if while body use external inputs (PaddlePaddle…

4c22609

…#71156)

[fixbugs] FusedBiasDropoutResidualLayerNorm not support fp16 (PaddleP…

91a3a5a

…addle#71161) * fix * fix codestyle

[Paddle TensorRT] delete trt_config.allow_only_specified_trt_ops API (P…

c0ad9ab

…addlePaddle#71154) * fix * fix * fix * fix

【dygraph】Split backward api from api.h to backward_api.h (PaddlePaddl…

d8668ad

…e#71142) * add gard_api inplace version * split api.h and backward_api.h * modify build wrong * modify build bug * change backward_api_yaml args * change backward_api_yaml args * modify impl to base

Revert "【CINN】longlong2int for dynamic shape (PaddlePaddle#71072)" (P…

e9dd380

…addlePaddle#71166) This reverts commit 64d227c.

[Distributed] add buffer check after optimizer step for fusion storage (

41d123e

PaddlePaddle#71114)

[Paddle TensorRT] Add pd_op.index_put、pd_op.pow converter (PaddlePadd…

95e90db

…le#71157) * add pow and index_put * fix codestyle * Update test_converter_math.py * update * fix codestyle

fix paddlex (PaddlePaddle#71187)

a5cefac

[CINN][Backend Pass Update No.12] Update transform_gpu_forloop pass (P…

924d1c0

…addlePaddle#70883) * Update transform_gpu_forloop pass * Update op_lowering_impl.cc * Update CudaSyncThreadsDropIfThenElse pass * Disable EliminateCommonGlobalMemoryRead pass

[XPU] fix clip grad ut (PaddlePaddle#71171)

6625d15

zhangbo9674 and others added 25 commits March 5, 2025 09:34

Revert "[AutoPrallel] Fix some bug (PaddlePaddle#71181)" (PaddlePaddl…

338bc10

…e#71384) This reverts commit 41f1bd8.

[CINN] Remove redundant group output before divide to fusion ops (Pad…

db474bf

…dlePaddle#71401) --------- Co-authored-by: zyfncg <zhangyunfei07@baidu.com>

[Serde] Ensure run save hook under dygraph mode (PaddlePaddle#71400)

184e5aa

--------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

【PIR】fused_bn_add_act_pass set channel to align 4 (PaddlePaddle#71209)

3884294

* channel align 8 * reset factor to 4

remove cas head file (PaddlePaddle#71367)

e62a11a

remove cas head file (PaddlePaddle#71368)

4fdeaf1

[CINN] fix vectorize info bug (PaddlePaddle#71324)

8999f62

fix cuda126 linux (PaddlePaddle#71363)

ba65d3f

add win eigen patch (PaddlePaddle#71414)

45f9f2a

[SOT] GraphLogger->SubGraphInfo to collect graph info (PaddlePadd…

ac08c9d

…le#71412)

[SOT] Standardize the from_iterator method of MapVariable and `Zi…

2ce7e5d

…pVariable` (PaddlePaddle#71407)

[XPU] reduce_xxx and broadcast_xxx use int64_t shape (PaddlePaddle#71361

b242135

)

【CINN】Remove cas head file --Part2 (PaddlePaddle#71369)

62033d3

* remove cas head file * fix ci

【custom】add Custom pass list in LoadCustomRuntimeLib and analysis_pre…

98b6ee1

…dictor will use it in customplace (PaddlePaddle#71362) * add customdevice default_pass * add declare * rm stdmove * add custom_load pass

[Prim] Add index_select_double_grad (PaddlePaddle#71352)

f51f6ee

* add UT: test_selected_high_order_derivative * remove default axis = 0 * update error msg * update op_compat.yaml * update UT * update code * only run UT in gpu

[CINN] Add InputOutputMaximumConstrain for Trivial Recompute (PaddleP…

cb59611

…addle#71408)

Fix typos simplied simplified (PaddlePaddle#71383)

642c4e6

[AutoParallel] Fix pipeline visualization tool (PaddlePaddle#71386)

35e4c25

* fix profile tool * fix profile tool

[CINN] Fix shape mismatch in axis transform simulation (PaddlePaddle#…

332220c

…71405)

[XPU] add isfinite/isinf support (PaddlePaddle#71364)

1463607

* [XPU] add isfinite/isinf support * fix test * fix

[CINN] TileBroadcastTactic NHWC layout support.

c3d1f43

Merge branch 'develop' of github.com:Enigmatisms/Paddle into develop

9697023

[CINN] TileBroadcastTactic NHWC layout.

7e6b5bd

Enigmatisms added 2 commits March 6, 2025 07:56

[CINN] revised code style problems.

36c1689

Merge branch 'develop' of github.com:Enigmatisms/Paddle into develop

c73a660

Enigmatisms closed this Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CINN] TileBroadcastTactic NHWC layout broadcast support #71434

[CINN] TileBroadcastTactic NHWC layout broadcast support #71434

Uh oh!

Enigmatisms commented Mar 5, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 5, 2025

Uh oh!

CLAassistant commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

[CINN] TileBroadcastTactic NHWC layout broadcast support #71434

[CINN] TileBroadcastTactic NHWC layout broadcast support #71434

Uh oh!

Conversation

Enigmatisms commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Introduction

Performance Impact

Limitation

Experiment Results

Uh oh!

paddle-bot bot commented Mar 5, 2025

Uh oh!

CLAassistant commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Enigmatisms commented Mar 5, 2025 •

edited

Loading

CLAassistant commented Mar 6, 2025 •

edited

Loading