[WIP][Bugfix] Fix illegal memory access in causal_conv1d Triton kernels with CUDA graphs#34685
[WIP][Bugfix] Fix illegal memory access in causal_conv1d Triton kernels with CUDA graphs#34685haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
…th CUDA graphs Replace unreliable `== pad_slot_id` comparisons with robust `< 0` checks in causal_conv1d Triton kernels to prevent out-of-bounds memory access when CUDA graph padding introduces PAD_SLOT_ID (-1) entries. Fixes vllm-project#34619 Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request provides a fix for a critical illegal memory access bug in the causal_conv1d Triton kernels. The root cause is a subtle type comparison issue in Triton, which is well-documented in the description. The change from == pad_slot_id to < 0 is a robust and correct solution to this problem, ensuring that padded slots are handled correctly without risking out-of-bounds memory access. The fix is applied consistently and appears to be a solid improvement.
Curious, where unsigned int comes from? I see all types are int64 what is signed... |
|
with this PR I got the same illegal mem access :( |
Purpose
Fix illegal memory access (CUDA error) when running hybrid models (e.g., Qwen3.5-397B-A17B) with CUDA graphs enabled, particularly with data parallel + expert parallel configurations.
Root cause: The
causal_conv1dTriton kernels use== pad_slot_idequality checks to detect padded CUDA graph entries.pad_slot_idis declared astl.constexpr(a compile-time Python int-1), while the values loaded from tensors are cast totl.int64at runtime. This cross-type equality comparison can silently fail in Triton due to type promotion semantics. When the check fails, the kernel proceeds to use-1(interpreted as0xFFFFFFFFFFFFFFFFin unsigned int64) as a memory offset intoconv_state, causing an out-of-bounds access.Fix: Replace
== pad_slot_idwith< 0at 3 locations incausal_conv1d.py. This is robust because valid slot/state indices are always non-negative, andPAD_SLOT_ID = -1is the only negative sentinel value. This matches the pattern already used by the workingfused_recurrent_gated_delta_rule_fwd_kernel.Fixes #34619
Test Plan
-dp 8 --enable-expert-paralleland Qwen3.5-397B-A17B model.Test Result
All 164 tests pass:
But I don't have 8-GPU to run Qwen3.5-397B-A17B