Skip to content

[WIP][Bugfix] Fix illegal memory access in causal_conv1d Triton kernels with CUDA graphs#34685

Open
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-34619
Open

[WIP][Bugfix] Fix illegal memory access in causal_conv1d Triton kernels with CUDA graphs#34685
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-34619

Conversation

@haosdent
Copy link
Contributor

@haosdent haosdent commented Feb 17, 2026

Purpose

Fix illegal memory access (CUDA error) when running hybrid models (e.g., Qwen3.5-397B-A17B) with CUDA graphs enabled, particularly with data parallel + expert parallel configurations.

Root cause: The causal_conv1d Triton kernels use == pad_slot_id equality checks to detect padded CUDA graph entries. pad_slot_id is declared as tl.constexpr (a compile-time Python int -1), while the values loaded from tensors are cast to tl.int64 at runtime. This cross-type equality comparison can silently fail in Triton due to type promotion semantics. When the check fails, the kernel proceeds to use -1 (interpreted as 0xFFFFFFFFFFFFFFFF in unsigned int64) as a memory offset into conv_state, causing an out-of-bounds access.

Fix: Replace == pad_slot_id with < 0 at 3 locations in causal_conv1d.py. This is robust because valid slot/state indices are always non-negative, and PAD_SLOT_ID = -1 is the only negative sentinel value. This matches the pattern already used by the working fused_recurrent_gated_delta_rule_fwd_kernel.

Fixes #34619

Test Plan

  1. Run existing causal_conv1d kernel tests covering update, update_with_batch_gather, and varlen scenarios.
  2. Full reproduction of the original issue requires 8-GPU setup with -dp 8 --enable-expert-parallel and Qwen3.5-397B-A17B model.

Test Result

All 164 tests pass:

$ python -m pytest tests/kernels/mamba/test_causal_conv1d.py -v
======================= 164 passed, 3 warnings in 23.53s =======================

But I don't have 8-GPU to run Qwen3.5-397B-A17B

…th CUDA graphs

Replace unreliable `== pad_slot_id` comparisons with robust `< 0` checks
in causal_conv1d Triton kernels to prevent out-of-bounds memory access
when CUDA graph padding introduces PAD_SLOT_ID (-1) entries.

Fixes vllm-project#34619

Signed-off-by: haosdent <haosdent@gmail.com>
@haosdent haosdent requested a review from tdoublep as a code owner February 17, 2026 10:06
@mergify mergify bot added nvidia bug Something isn't working labels Feb 17, 2026
@haosdent haosdent changed the title [Bugfix] Fix illegal memory access in causal_conv1d Triton kernels with CUDA graphs [WIP][Bugfix] Fix illegal memory access in causal_conv1d Triton kernels with CUDA graphs Feb 17, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a fix for a critical illegal memory access bug in the causal_conv1d Triton kernels. The root cause is a subtle type comparison issue in Triton, which is well-documented in the description. The change from == pad_slot_id to < 0 is a robust and correct solution to this problem, ensuring that padded slots are handled correctly without risking out-of-bounds memory access. The fix is applied consistently and appears to be a solid improvement.

@vadiklyutiy
Copy link
Collaborator

vadiklyutiy commented Feb 17, 2026

When the check fails, the kernel proceeds to use -1 (interpreted as 0xFFFFFFFFFFFFFFFF in unsigned int64) as a memory offset into conv_state, causing an out-of-bounds access.

Curious, where unsigned int comes from? I see all types are int64 what is signed...

@vadiklyutiy
Copy link
Collaborator

with this PR I got the same illegal mem access :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug]: Qwen3.5. illegal memory access

2 participants