Description
Your current environment
VLLM openai docker is used:
vllm/vllm-openai:v0.8.5.post1
🐛 Describe the bug
Running more than 1 vllm instance on a single GPU with VLLM V1 enabled fails.
The same setup with VLLM_USE_V1=False
works.
The issue is that VLLM V1 total_allocated_bytes
mistakenly includes memory consumed by other vLLM instances.
The first model starts without an error. The second instance fails due to vram OOO.
The documentation on --gpu-memory-utilization it says:
The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. This is a per-instance limit, and only applies to the current vLLM instance. It does not matter if you have another vLLM instance running on the same GPU. For example, if you have two vLLM instances running on the same GPU, you can set the GPU memory utilization to 0.5 for each instance.
Default: 0.9
So V1 memory management should reflect this, or update the documentation for V1's new behaviour (of using existing GPU available VRAM, vs total VRAM).
The error is likely how V1 calculates the free VRAM:
vllm/vllm/v1/worker/gpu_worker.py
Lines 200 to 210 in 3015d56
total_allocated_bytes
mistakenly includes memory consumed by other vLLM instances. As a result, non_torch_allocations
includes the redundant memory, which leads to an unnecessarily low, or even negative, available_kv_cache_memory
.
HW: NVIDIA 4090
To replicate the issue you can use those commands:
docker run -d --runtime nvidia --gpus 1 \
--env "HUGGING_FACE_HUB_TOKEN=TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.8.5.post1 \
--port 8000 \
--model HuggingFaceTB/SmolLM-135M \
--gpu-memory-utilization 0.3 \
--max-num-seqs 256 \
--tensor-parallel-size 1
docker run -d --runtime nvidia --gpus 1 \
--env "HUGGING_FACE_HUB_TOKEN=TOKEN" \
-p 8001:8000 \
--ipc=host \
vllm/vllm-openai:v0.8.5.post1 \
--port 8000 \
--model HuggingFaceTB/SmolLM-135M \
--gpu-memory-utilization 0.3 \
--max-num-seqs 256 \
--tensor-parallel-size 1
Error message:
INFO 05-07 02:13:52 [loader.py:458] Loading weights took 0.10 seconds
INFO 05-07 02:13:52 [gpu_model_runner.py:1347] Model loading took 0.2533 GiB and 0.907387 seconds
INFO 05-07 02:13:56 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/3069c682cd/rank_0_0 for vLLM's torch.compile
INFO 05-07 02:13:56 [backends.py:430] Dynamo bytecode transform time: 3.58 s
INFO 05-07 02:13:57 [backends.py:136] Cache the graph of shape None for later use
INFO 05-07 02:14:09 [backends.py:148] Compiling a graph for general shape takes 13.43 s
INFO 05-07 02:14:18 [monitor.py:33] torch.compile takes 17.00 s in total
ERROR 05-07 02:14:18 [core.py:396] EngineCore failed to start.
ERROR 05-07 02:14:18 [core.py:396] Traceback (most recent call last):
ERROR 05-07 02:14:18 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-07 02:14:18 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-07 02:14:18 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 02:14:18 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-07 02:14:18 [core.py:396] super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-07 02:14:18 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 05-07 02:14:18 [core.py:396] self._initialize_kv_caches(vllm_config)
ERROR 05-07 02:14:18 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
ERROR 05-07 02:14:18 [core.py:396] get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 05-07 02:14:18 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
ERROR 05-07 02:14:18 [core.py:396] check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 05-07 02:14:18 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
ERROR 05-07 02:14:18 [core.py:396] raise ValueError("No available memory for the cache blocks. "
ERROR 05-07 02:14:18 [core.py:396] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
self._initialize_kv_caches(vllm_config)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
[rank0]:[W507 02:14:19.323764213 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
self.engine_core = core_client_class(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
super().__init__(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
self._wait_for_engine_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
Related issues:
#17366 #14376 #10643 #16141
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.