Skip to content

Revert "[CI] Update FlashInfer to 0.2.6.post1" --- edit: No, better add "12.0" to FlashInfer TORCH_CUDA_ARCH_LIST see PR #19794 #19810

Closed
@cyril23

Description

@cyril23

The Pullrequest #19297 broke SM 120 Blackwell compability (RTX 50xx, RTX PRO).

You can't use -e VLLM_USE_FLASHINFER_SAMPLER=1 anymore (which is the default) and need to fall back to -e VLLM_USE_FLASHINFER_SAMPLER=0 which will give you less performance and this warning:

WARNING 06-18 08:55:01 [topk_topp_sampler.py:52] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.

I did 2 builds, both with --build-arg torch_cuda_arch_list='12.0 (SM 120 compatible only) and pushed them to Docker hub:

  1. wurstdeploy/vllm:azure10thjunesolo120 which is based on the last commit of 10th June (da9b523) and which still uses the old FlashInfer version
git checkout -b 10thjune da9b523ce1fd5c27bfd18921ba0388bf2e8e4618
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azure10thjunesolo120 --target vllm-openai   --progress plain -f docker/Dockerfile .

# this is still SM 120 compatible, you can run via
sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  wurstdeploy/vllm:azure10thjunesolo120    --model Qwen/Qwen3-0.6B
  1. wurstdeploy/vllm:azure11thjunesolo120 which is based on the last commit of 11th June (42f52cc) and already includes the commit 497a91e and therefore the updated Flashinfer version
git checkout -b 11thjune 42f52cc95bf34a2e15f4cdbc8474503a9bcc970f
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azure11thjunesolo120 --target vllm-openai   --progress plain -f docker/Dockerfile .

# this is not fully SM 120 compatible anymore:
sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  wurstdeploy/vllm:azure11thjunesolo120    --model Qwen/Qwen3-0.6B

INFO 06-18 08:53:41 [monitor.py:34] torch.compile takes 18.01 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Process EngineCore_0:
ERROR 06-18 08:53:41 [core.py:515] EngineCore failed to start.
ERROR 06-18 08:53:41 [core.py:515] Traceback (most recent call last):
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-18 08:53:41 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-18 08:53:41 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-18 08:53:41 [core.py:515]     self._initialize_kv_caches(vllm_config)
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-18 08:53:41 [core.py:515]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-18 08:53:41 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-18 08:53:41 [core.py:515]     output = self.collective_rpc("determine_available_memory")
ERROR 06-18 08:53:41 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-18 08:53:41 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-18 08:53:41 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2680, in run_method
ERROR 06-18 08:53:41 [core.py:515]     return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-18 08:53:41 [core.py:515]     return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
ERROR 06-18 08:53:41 [core.py:515]     self.model_runner.profile_run()
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2015, in profile_run
ERROR 06-18 08:53:41 [core.py:515]     sampler_output = self._dummy_sampler_run(hidden_states)
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-18 08:53:41 [core.py:515]     return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1913, in _dummy_sampler_run
ERROR 06-18 08:53:41 [core.py:515]     raise e
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1903, in _dummy_sampler_run
ERROR 06-18 08:53:41 [core.py:515]     sampler_output = self.sampler(logits=logits,
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-18 08:53:41 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-18 08:53:41 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 52, in forward
ERROR 06-18 08:53:41 [core.py:515]     sampled = self.sample(logits, sampling_metadata)
ERROR 06-18 08:53:41 [core.py:515]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 118, in sample
ERROR 06-18 08:53:41 [core.py:515]     random_sampled = self.topk_topp_sampler(
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-18 08:53:41 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-18 08:53:41 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_cuda
ERROR 06-18 08:53:41 [core.py:515]     return flashinfer_sample(logits, k, p, generators)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 290, in flashinfer_sample
ERROR 06-18 08:53:41 [core.py:515]     next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 901, in top_k_top_p_sampling_from_logits
ERROR 06-18 08:53:41 [core.py:515]     masked_logits = top_k_mask_logits(logits, top_k)
ERROR 06-18 08:53:41 [core.py:515]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 1221, in top_k_mask_logits
ERROR 06-18 08:53:41 [core.py:515]     return get_sampling_module().top_k_mask_logits(
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 352, in top_k_mask_logits
ERROR 06-18 08:53:41 [core.py:515]     module.top_k_mask_logits.default(
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
ERROR 06-18 08:53:41 [core.py:515]     return self._op(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device


# you can only run it without Flashinfer, i.e. -e VLLM_USE_FLASHINFER_SAMPLER=0:
sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 \
  wurstdeploy/vllm:azure11thjunesolo120    --model Qwen/Qwen3-0.6B
> WARNING 06-18 08:55:01 [topk_topp_sampler.py:52] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.

Originally posted by @cyril23 in #19297 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions