Skip to content

Hybrid Attention broken since 0.4.0 (affects Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #1931

@Steven0236

Description

@Steven0236

After upgrading vllm (compiled from source) and flashinfer, I noticed that Qwen3-Next-80b has lost a lot of precision and can't use tools and talks nonsense after the 2nd conversation turn. If I switch the backend to "FLASH_ATTN", the problem goes away. This makes me think that the problem is likely in flashinfer.

Here is the troublesome configuration:

VLLM_ATTENTION_BACKEND="FLASHINFER"
VLLM_USE_FLASHINFER_SAMPLER="0"
MAX_JOBS="16"
OMP_NUM_THREADS="16"
NVCC_THREADS="16"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

vllm serve /AI2/vllm_models/Text/2509/Qwen3-Next-80B-AWQ4     \
  --trust-remote-code                                         \
  --max-model-len                  30000                      \
  --max-num-seqs                   1                          \
  --tensor-parallel-size           4                          \
  --pipeline-parallel-size         1                          \
  --enable-auto-tool-choice                                   \
  --tool-call-parser               hermes                     \
  --tokenizer-mode                 auto                       \
  --gpu-memory-utilization         0.89                       \
  --no-enable-chunked-prefill                                 

Note that I turned off flashinfer_sampler in both cases to help isolate the problem.
I also tried to downgrade flashinfer to 0.3.0 to compare, but that generated errors (maybe the latest vllm needs 0.4.0 somehow).

The troublesome versions are:

vllm:  0.11.0rc2.dev408+g55392bc87.d20251011 
Flashinfer: 0.4.0

If I go back to my earlier version setup shown below. The problem doesn't exist.

vllm:  0.11.0rc2.dev59+gfed8a9b10.d20250926
Flashinfer: 0.3.1

Any confirmation or suggestion appreciated.

PS: I will start studying the verification infrastructure and see if I can contribute some effort to help improve it later so as to catch such problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions