[Operator Mechanism]Fix cuda launch dimension checks for elementwise and fusion kernels by feixi139 · Pull Request #79375 · PaddlePaddle/Paddle

feixi139 · 2026-06-25T09:46:56Z

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

This PR updates CUDA kernel launch configuration in the elementwise kernel path, fusion/gpudnn kernels, and miscellaneous kernel modules including legacy, stride, impl, sparse, and math helper kernels.
The changes add explicit UINT32_MAX and device-limit checks for CUDA grid/block dimensions before kernel launches, and cast the validated launch parameters to uint32_t. This avoids implicit narrowing when int64_t or size_t values are passed as CUDA launch dimensions.
This PR is split from #79333 by directory/module scope to reduce review size. The changes are limited to CUDA launch configuration handling and do not intend to change operator algorithms or numerical behavior.

是否引起精度变化

否

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-25 18:14:58

📋 Review 摘要

PR 概述：为 elementwise、fusion、gpudnn 等 CUDA kernel launch 参数增加设备上限和 uint32_t 范围校验。
变更范围：paddle/phi/kernels/ 下 GPU、GPUDNN、fusion、legacy 等 kernel launch 配置。
影响面 Tag：[Operator Mechanism]

问题

级别	文件	概述
🔴 Bug	`paddle/phi/kernels/funcs/elementwise_grad_base.h:1875`	broadcast elementwise grad 无条件校验未请求的梯度方向，可能让只求单侧梯度提前失败

📝 PR 规范检查

标题缺少 Tag 后的空格。

标题建议（可直接复制）：

[Operator Mechanism] Fix cuda launch dimension checks for elementwise and fusion kernels

PR 描述结构符合模板，是否引起精度变化 已填写为 否。

总体评价

整体方向符合本 PR 的目标，但 CommonGradBroadcastCUDA 新增的范围检查需要按实际 launch 分支下沉，否则会把未请求的梯度方向也纳入硬性校验，引入大 shape 单侧反向的行为回归。

PaddlePaddle-bot · 2026-06-25T10:15:54Z

  int x_block_size =
      std::min(static_cast<int64_t>(ELEMWISE_MAX_BLOCK_DIM), x_threads);
+  uint32_t max_grid_dim = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  PADDLE_ENFORCE_LE(x_blocks,


🔴 Bug 这里无条件校验了 x_blocks 和 y_blocks，但下面实际只在 if (dx) / if (dy) 中分别启动对应 kernel。

ElemwiseGradCompute 支持只请求单侧梯度，例如 dx != nullptr && dy == nullptr。这种情况下未请求的 dy kernel 不会 launch；现在如果 y_blocks 超过设备上限，即使只计算 dx 也会在这里抛错，反向同理。

建议修复方式：把 x_blocks 的 device/uint32_t 校验移动到 if (dx) 内、y_blocks 的校验移动到 if (dy) 内，并保持在各自 static_cast<uint32_t>(...) 和 kernel launch 之前执行。

fix cuda launch dimension checks for elementwise and fusion kernels

a2f1fc5

PaddlePaddle-bot suggested changes Jun 25, 2026

View reviewed changes

paddle-bot Bot added the contributor External developers label Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Operator Mechanism]Fix cuda launch dimension checks for elementwise and fusion kernels#79375

[Operator Mechanism]Fix cuda launch dimension checks for elementwise and fusion kernels#79375
feixi139 wants to merge 1 commit into
PaddlePaddle:developfrom
feixi139:fix_launch_uint32_elementwise_fusion_misc

feixi139 commented Jun 25, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

feixi139 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Description

是否引起精度变化

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feixi139 commented Jun 25, 2026 •

edited

Loading