Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: flashinfer not in docker build #6221

Closed
pseudotensor opened this issue Jul 8, 2024 · 5 comments
Closed

[Bug]: flashinfer not in docker build #6221

pseudotensor opened this issue Jul 8, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@pseudotensor
Copy link

pseudotensor commented Jul 8, 2024

Your current environment

Same env and launch command as #6220 but on head of main at ddc369f

For launch, added this:

export VLLM_ATTENTION_BACKEND=FLASHINFER

because failed with this error otherwise:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 76, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/model_runner.py", line 866, in profile_run
[rank0]:     model_input = self.prepare_model_input(
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/model_runner.py", line 1161, in prepare_model_input
[rank0]:     model_input = self._prepare_model_input_tensors(
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/model_runner.py", line 690, in _prepare_model_input_tensors
[rank0]:     raise ValueError("Please use Flashinfer backend for models with"
[rank0]: ValueError: Please use Flashinfer backend for models withlogits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

🐛 Describe the bug

So with that env set, I get:

INFO 07-08 19:57:45 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-08 19:57:54 model_runner.py:255] Loading model weights took 50.8043 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 76, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/model_runner.py", line 874, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/model_runner.py", line 1201, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

@pseudotensor pseudotensor added the bug Something isn't working label Jul 8, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Jul 8, 2024

This error signifies the FlashInfer version might not the correct. Can you double check it is v0.0.8? https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.8

@pseudotensor
Copy link
Author

I followed this:

https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation

and it no longer fails. vllm docker image does not seem to have flashinfer.

@pseudotensor pseudotensor changed the title [Bug]: BatchDecodeWithPagedKVCacheWrapper( [rank0]: TypeError: 'NoneType' object is not callable [Bug]: flashinfer not in docker build Jul 8, 2024
@pseudotensor
Copy link
Author

So I changed title.

@simon-mo
Copy link
Collaborator

simon-mo commented Jul 8, 2024

The published image for latest and v0.5.1 should have FlashInfer, following this PR: #6172

@pseudotensor
Copy link
Author

Ok thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants