Hybrid Attention broken since 0.4.0 (affects Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b)

After upgrading vllm (compiled from source) and flashinfer, I noticed that Qwen3-Next-80b has lost a lot of precision and can't use tools and talks nonsense after the 2nd conversation turn.  If I switch the backend to "FLASH_ATTN", the problem goes away.  This makes me think that the problem is likely in flashinfer.
 
Here is the troublesome configuration:
```
VLLM_ATTENTION_BACKEND="FLASHINFER"
VLLM_USE_FLASHINFER_SAMPLER="0"
MAX_JOBS="16"
OMP_NUM_THREADS="16"
NVCC_THREADS="16"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

vllm serve /AI2/vllm_models/Text/2509/Qwen3-Next-80B-AWQ4     \
  --trust-remote-code                                         \
  --max-model-len                  30000                      \
  --max-num-seqs                   1                          \
  --tensor-parallel-size           4                          \
  --pipeline-parallel-size         1                          \
  --enable-auto-tool-choice                                   \
  --tool-call-parser               hermes                     \
  --tokenizer-mode                 auto                       \
  --gpu-memory-utilization         0.89                       \
  --no-enable-chunked-prefill                                 
```

Note that I turned off flashinfer_sampler in both cases to help isolate the problem.
I also tried to downgrade flashinfer to 0.3.0 to compare, but that generated errors (maybe the latest vllm needs 0.4.0 somehow).

The troublesome versions are:
```
vllm:  0.11.0rc2.dev408+g55392bc87.d20251011 
Flashinfer: 0.4.0
```

If I go back to my earlier version setup shown below.  The problem doesn't exist. 
```
vllm:  0.11.0rc2.dev59+gfed8a9b10.d20250926
Flashinfer: 0.3.1
```

Any confirmation or suggestion appreciated.

PS: I will start studying the verification infrastructure and see if I can contribute some effort to help improve it later so as to catch such problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid Attention broken since 0.4.0 (affects Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #1931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hybrid Attention broken since 0.4.0 (affects Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #1931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions