Closed
Description
Your current environment
Running in Docker container (Kubernetes) on (4) GH200 nodes. 1 GPU per node.
Model Input Dumps
python3 -m vllm.entrypoints.openai.api_server --model /models/my-model
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--served-model-name my-model
--trust-remote-code
--api-key "NONE"
--rope-scaling '{"rope_type":"dynamic","factor":4.0}'
--enable-prefix-caching
--max-model-len 131072
🐛 Describe the bug
@youkaichao, it looks like #11256 forces --tensor-parallel-size to be > per node GPU.
https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L156
# Use confusing message for more common TP-only case.
assert tensor_parallel_size <= cuda_device_count, (
f"please set tensor_parallel_size ({tensor_parallel_size}) "
f"to less than max local gpu count ({cuda_device_count})")
Currently testing main
with (4) nodes, (1) GPU per node results in (same model/code/execution works perfectly in v0.6.4.post1):
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 382, in run_mp_engine
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 371, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 115, in from_engine_args
engine_config = engine_args.create_engine_config(usage_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1244, in create_engine_config
config = VllmConfig(
^^^^^^^^^^^
File "<string>", line 19, in __init__
File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 3204, in __post_init__
current_platform.check_and_update_config(self)
File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 156, in check_and_update_config
assert tensor_parallel_size <= cuda_device_count, (
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: please set tensor_parallel_size (4) to less than max local gpu count (1)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.