Closed
Description
The Pullrequest #19297 broke SM 120 Blackwell compability (RTX 50xx, RTX PRO).
You can't use -e VLLM_USE_FLASHINFER_SAMPLER=1
anymore (which is the default) and need to fall back to -e VLLM_USE_FLASHINFER_SAMPLER=0
which will give you less performance and this warning:
WARNING 06-18 08:55:01 [topk_topp_sampler.py:52] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.
I did 2 builds, both with --build-arg torch_cuda_arch_list='12.0
(SM 120 compatible only) and pushed them to Docker hub:
wurstdeploy/vllm:azure10thjunesolo120
which is based on the last commit of 10th June (da9b523) and which still uses the old FlashInfer version
git checkout -b 10thjune da9b523ce1fd5c27bfd18921ba0388bf2e8e4618
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64 --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='12.0' --build-arg RUN_WHEEL_CHECK=false --tag wurstdeploy/vllm:azure10thjunesolo120 --target vllm-openai --progress plain -f docker/Dockerfile .
# this is still SM 120 compatible, you can run via
sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
wurstdeploy/vllm:azure10thjunesolo120 --model Qwen/Qwen3-0.6B
wurstdeploy/vllm:azure11thjunesolo120
which is based on the last commit of 11th June (42f52cc) and already includes the commit 497a91e and therefore the updated Flashinfer version
git checkout -b 11thjune 42f52cc95bf34a2e15f4cdbc8474503a9bcc970f
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64 --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='12.0' --build-arg RUN_WHEEL_CHECK=false --tag wurstdeploy/vllm:azure11thjunesolo120 --target vllm-openai --progress plain -f docker/Dockerfile .
# this is not fully SM 120 compatible anymore:
sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
wurstdeploy/vllm:azure11thjunesolo120 --model Qwen/Qwen3-0.6B
INFO 06-18 08:53:41 [monitor.py:34] torch.compile takes 18.01 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Process EngineCore_0:
ERROR 06-18 08:53:41 [core.py:515] EngineCore failed to start.
ERROR 06-18 08:53:41 [core.py:515] Traceback (most recent call last):
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-18 08:53:41 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-18 08:53:41 [core.py:515] super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-18 08:53:41 [core.py:515] self._initialize_kv_caches(vllm_config)
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-18 08:53:41 [core.py:515] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-18 08:53:41 [core.py:515] output = self.collective_rpc("determine_available_memory")
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-18 08:53:41 [core.py:515] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2680, in run_method
ERROR 06-18 08:53:41 [core.py:515] return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-18 08:53:41 [core.py:515] return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
ERROR 06-18 08:53:41 [core.py:515] self.model_runner.profile_run()
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2015, in profile_run
ERROR 06-18 08:53:41 [core.py:515] sampler_output = self._dummy_sampler_run(hidden_states)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-18 08:53:41 [core.py:515] return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1913, in _dummy_sampler_run
ERROR 06-18 08:53:41 [core.py:515] raise e
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1903, in _dummy_sampler_run
ERROR 06-18 08:53:41 [core.py:515] sampler_output = self.sampler(logits=logits,
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-18 08:53:41 [core.py:515] return self._call_impl(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-18 08:53:41 [core.py:515] return forward_call(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 52, in forward
ERROR 06-18 08:53:41 [core.py:515] sampled = self.sample(logits, sampling_metadata)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 118, in sample
ERROR 06-18 08:53:41 [core.py:515] random_sampled = self.topk_topp_sampler(
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-18 08:53:41 [core.py:515] return self._call_impl(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-18 08:53:41 [core.py:515] return forward_call(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_cuda
ERROR 06-18 08:53:41 [core.py:515] return flashinfer_sample(logits, k, p, generators)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 290, in flashinfer_sample
ERROR 06-18 08:53:41 [core.py:515] next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 901, in top_k_top_p_sampling_from_logits
ERROR 06-18 08:53:41 [core.py:515] masked_logits = top_k_mask_logits(logits, top_k)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 1221, in top_k_mask_logits
ERROR 06-18 08:53:41 [core.py:515] return get_sampling_module().top_k_mask_logits(
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 352, in top_k_mask_logits
ERROR 06-18 08:53:41 [core.py:515] module.top_k_mask_logits.default(
ERROR 06-18 08:53:41 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
ERROR 06-18 08:53:41 [core.py:515] return self._op(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device
# you can only run it without Flashinfer, i.e. -e VLLM_USE_FLASHINFER_SAMPLER=0:
sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 \
-e VLLM_USE_FLASHINFER_SAMPLER=0 \
wurstdeploy/vllm:azure11thjunesolo120 --model Qwen/Qwen3-0.6B
> WARNING 06-18 08:55:01 [topk_topp_sampler.py:52] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.
Originally posted by @cyril23 in #19297 (comment)
Metadata
Metadata
Assignees
Labels
No labels