-
Notifications
You must be signed in to change notification settings - Fork 558
Description
After upgrading vllm (compiled from source) and flashinfer, I noticed that Qwen3-Next-80b has lost a lot of precision and can't use tools and talks nonsense after the 2nd conversation turn. If I switch the backend to "FLASH_ATTN", the problem goes away. This makes me think that the problem is likely in flashinfer.
Here is the troublesome configuration:
VLLM_ATTENTION_BACKEND="FLASHINFER"
VLLM_USE_FLASHINFER_SAMPLER="0"
MAX_JOBS="16"
OMP_NUM_THREADS="16"
NVCC_THREADS="16"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
vllm serve /AI2/vllm_models/Text/2509/Qwen3-Next-80B-AWQ4 \
--trust-remote-code \
--max-model-len 30000 \
--max-num-seqs 1 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--tokenizer-mode auto \
--gpu-memory-utilization 0.89 \
--no-enable-chunked-prefill
Note that I turned off flashinfer_sampler in both cases to help isolate the problem.
I also tried to downgrade flashinfer to 0.3.0 to compare, but that generated errors (maybe the latest vllm needs 0.4.0 somehow).
The troublesome versions are:
vllm: 0.11.0rc2.dev408+g55392bc87.d20251011
Flashinfer: 0.4.0
If I go back to my earlier version setup shown below. The problem doesn't exist.
vllm: 0.11.0rc2.dev59+gfed8a9b10.d20250926
Flashinfer: 0.3.1
Any confirmation or suggestion appreciated.
PS: I will start studying the verification infrastructure and see if I can contribute some effort to help improve it later so as to catch such problems.