Skip to content

[Operator Mechanism]Fix cuda launch dimension checks for elementwise and fusion kernels#79375

Open
feixi139 wants to merge 1 commit into
PaddlePaddle:developfrom
feixi139:fix_launch_uint32_elementwise_fusion_misc
Open

[Operator Mechanism]Fix cuda launch dimension checks for elementwise and fusion kernels#79375
feixi139 wants to merge 1 commit into
PaddlePaddle:developfrom
feixi139:fix_launch_uint32_elementwise_fusion_misc

Conversation

@feixi139

@feixi139 feixi139 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

Description

This PR updates CUDA kernel launch configuration in the elementwise kernel path, fusion/gpudnn kernels, and miscellaneous kernel modules including legacy, stride, impl, sparse, and math helper kernels.
The changes add explicit UINT32_MAX and device-limit checks for CUDA grid/block dimensions before kernel launches, and cast the validated launch parameters to uint32_t. This avoids implicit narrowing when int64_t or size_t values are passed as CUDA launch dimensions.
This PR is split from #79333 by directory/module scope to reduce review size. The changes are limited to CUDA launch configuration handling and do not intend to change operator algorithms or numerical behavior.

是否引起精度变化

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-25 18:14:58

📋 Review 摘要

PR 概述:为 elementwise、fusion、gpudnn 等 CUDA kernel launch 参数增加设备上限和 uint32_t 范围校验。
变更范围paddle/phi/kernels/ 下 GPU、GPUDNN、fusion、legacy 等 kernel launch 配置。
影响面 Tag[Operator Mechanism]

问题

级别 文件 概述
🔴 Bug paddle/phi/kernels/funcs/elementwise_grad_base.h:1875 broadcast elementwise grad 无条件校验未请求的梯度方向,可能让只求单侧梯度提前失败

📝 PR 规范检查

标题缺少 Tag 后的空格。

标题建议(可直接复制):

  • [Operator Mechanism] Fix cuda launch dimension checks for elementwise and fusion kernels

PR 描述结构符合模板,是否引起精度变化 已填写为

总体评价

整体方向符合本 PR 的目标,但 CommonGradBroadcastCUDA 新增的范围检查需要按实际 launch 分支下沉,否则会把未请求的梯度方向也纳入硬性校验,引入大 shape 单侧反向的行为回归。

int x_block_size =
std::min(static_cast<int64_t>(ELEMWISE_MAX_BLOCK_DIM), x_threads);
uint32_t max_grid_dim = dev_ctx.GetCUDAMaxGridDimSize()[0];
PADDLE_ENFORCE_LE(x_blocks,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里无条件校验了 x_blocksy_blocks,但下面实际只在 if (dx) / if (dy) 中分别启动对应 kernel。

ElemwiseGradCompute 支持只请求单侧梯度,例如 dx != nullptr && dy == nullptr。这种情况下未请求的 dy kernel 不会 launch;现在如果 y_blocks 超过设备上限,即使只计算 dx 也会在这里抛错,反向同理。

建议修复方式:把 x_blocks 的 device/uint32_t 校验移动到 if (dx) 内、y_blocks 的校验移动到 if (dy) 内,并保持在各自 static_cast<uint32_t>(...) 和 kernel launch 之前执行。

@paddle-bot paddle-bot Bot added the contributor External developers label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants