Skip to content

[Bug]: available VRAM calculation bug in V1 #17979

Closed
@YanickSchraner

Description

@YanickSchraner

Your current environment

VLLM openai docker is used:
vllm/vllm-openai:v0.8.5.post1

🐛 Describe the bug

Running more than 1 vllm instance on a single GPU with VLLM V1 enabled fails.
The same setup with VLLM_USE_V1=False works.
The issue is that VLLM V1 total_allocated_bytes mistakenly includes memory consumed by other vLLM instances.

The first model starts without an error. The second instance fails due to vram OOO.

The documentation on --gpu-memory-utilization it says:

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. This is a per-instance limit, and only applies to the current vLLM instance. It does not matter if you have another vLLM instance running on the same GPU. For example, if you have two vLLM instances running on the same GPU, you can set the GPU memory utilization to 0.5 for each instance.

Default: 0.9

So V1 memory management should reflect this, or update the documentation for V1's new behaviour (of using existing GPU available VRAM, vs total VRAM).

The error is likely how V1 calculates the free VRAM:

torch.cuda.empty_cache()
torch_allocated_bytes = torch.cuda.memory_stats(
)["allocated_bytes.all.current"]
total_allocated_bytes = torch.cuda.mem_get_info(
)[1] - torch.cuda.mem_get_info()[0]
non_torch_allocations = total_allocated_bytes - torch_allocated_bytes
if non_torch_allocations > 0:
peak_memory += non_torch_allocations
available_kv_cache_memory = (
total_gpu_memory * self.cache_config.gpu_memory_utilization -
peak_memory)

total_allocated_bytes mistakenly includes memory consumed by other vLLM instances. As a result, non_torch_allocations includes the redundant memory, which leads to an unnecessarily low, or even negative, available_kv_cache_memory.

HW: NVIDIA 4090
To replicate the issue you can use those commands:

docker run -d --runtime nvidia --gpus 1 \
    --env "HUGGING_FACE_HUB_TOKEN=TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.8.5.post1 \
    --port 8000 \
    --model HuggingFaceTB/SmolLM-135M \
    --gpu-memory-utilization 0.3 \
    --max-num-seqs 256 \
    --tensor-parallel-size 1

docker run -d --runtime nvidia --gpus 1 \
    --env "HUGGING_FACE_HUB_TOKEN=TOKEN" \
    -p 8001:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.8.5.post1 \
    --port 8000 \
    --model HuggingFaceTB/SmolLM-135M \
    --gpu-memory-utilization 0.3 \
    --max-num-seqs 256 \
    --tensor-parallel-size 1

Error message:

INFO 05-07 02:13:52 [loader.py:458] Loading weights took 0.10 seconds
INFO 05-07 02:13:52 [gpu_model_runner.py:1347] Model loading took 0.2533 GiB and 0.907387 seconds
INFO 05-07 02:13:56 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/3069c682cd/rank_0_0 for vLLM's torch.compile
INFO 05-07 02:13:56 [backends.py:430] Dynamo bytecode transform time: 3.58 s
INFO 05-07 02:13:57 [backends.py:136] Cache the graph of shape None for later use
INFO 05-07 02:14:09 [backends.py:148] Compiling a graph for general shape takes 13.43 s
INFO 05-07 02:14:18 [monitor.py:33] torch.compile takes 17.00 s in total
ERROR 05-07 02:14:18 [core.py:396] EngineCore failed to start.
ERROR 05-07 02:14:18 [core.py:396] Traceback (most recent call last):
ERROR 05-07 02:14:18 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-07 02:14:18 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-07 02:14:18 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-07 02:14:18 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-07 02:14:18 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-07 02:14:18 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 05-07 02:14:18 [core.py:396]     self._initialize_kv_caches(vllm_config)
ERROR 05-07 02:14:18 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
ERROR 05-07 02:14:18 [core.py:396]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 05-07 02:14:18 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
ERROR 05-07 02:14:18 [core.py:396]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 05-07 02:14:18 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
ERROR 05-07 02:14:18 [core.py:396]     raise ValueError("No available memory for the cache blocks. "
ERROR 05-07 02:14:18 [core.py:396] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
    get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
    check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
[rank0]:[W507 02:14:19.323764213 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

Related issues:
#17366 #14376 #10643 #16141

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions