Skip to content

[Bug]: Cannot run two containers on one card when using VLLM_USE_V1=1 #17366

Closed
@feifan-sun

Description

@feifan-sun

Your current environment

I started two docker containers on a card on an A100(80G) machine to run the w8w8 quantized qwen2.5 14B model, limiting gpu-memory-utilization to 0.3 respectively. The first service can be loaded normally, but the second service fails to start. The GPU memory is sufficient.

It can be reproduced in v0.8.* versions, In addition, it runs normally when using VLLM_USE_V1=0.

🐛 Describe the bug

the first container

docker run -itd --gpus='"device=0"' --name vllm_qwen_first -e VLLM_USE_V1=1 \
    --shm-size 10g -v /data/model/qwen2_5_14b_w8a8/:/data/model/ --network=host \
    vllm/vllm-openai:v0.8.5  \
    --model /data/model/ --max-model-len 4096 --gpu-memory-utilization 0.3 --port 8900

Image

The service runs normally, and there were no other services on the card before.

the second container

docker run -itd --gpus='"device=0"' --name vllm_qwen_second -e VLLM_USE_V1=1 \
    --shm-size 10g -v /data/model/qwen2_5_14b_w8a8/:/data/model/ --network=host \
    vllm/vllm-openai:v0.8.5  \
    --model /data/model/ --max-model-len 4096 --gpu-memory-utilization 0.3 --port 8901

The startup log is as follows:

INFO 04-29 01:34:32 [api_server.py:1043] vLLM API server version 0.8.5
INFO 04-29 01:34:32 [api_server.py:1044] args: Namespace(host=None, port=8901, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/model/', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='slow', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=32, gpu_memory_utilization=0.3, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-29 01:34:40 [config.py:717] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
INFO 04-29 01:34:41 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-29 01:34:41 [tokenizer.py:251] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 04-29 01:34:45 [__init__.py:239] Automatically detected platform cuda.
INFO 04-29 01:34:48 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='/data/model/', speculative_config=None, tokenizer='/data/model/', skip_tokenizer_init=False, tokenizer_mode=slow, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/data/model/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-29 01:34:49 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fcc25ea7a40>
INFO 04-29 01:34:49 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-29 01:34:49 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-29 01:34:49 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
INFO 04-29 01:34:49 [gpu_model_runner.py:1329] Starting to load model /data/model/...
INFO 04-29 01:34:50 [compressed_tensors_w8a8_int8.py:51] Using CutlassScaledMMLinearKernel for CompressedTensorsW8A8Int8
Loading safetensors checkpoint shards: 100% 4/4 [00:06<00:00,  1.68s/it]
INFO 04-29 01:34:57 [loader.py:458] Loading weights took 6.79 seconds
WARNING 04-29 01:34:57 [kv_cache.py:128] Using Q scale 1.0 and prob scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure Q/prob scaling factors are available in the fp8 checkpoint.
INFO 04-29 01:34:57 [gpu_model_runner.py:1347] Model loading took 15.4004 GiB and 7.120341 seconds
INFO 04-29 01:35:14 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/a7fca1718d/rank_0_0 for vLLM's torch.compile
INFO 04-29 01:35:14 [backends.py:430] Dynamo bytecode transform time: 16.79 s
INFO 04-29 01:35:18 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 01:36:12 [backends.py:148] Compiling a graph for general shape takes 56.67 s
INFO 04-29 01:36:33 [monitor.py:33] torch.compile takes 73.46 s in total
ERROR 04-29 01:36:35 [core.py:396] EngineCore failed to start.
ERROR 04-29 01:36:35 [core.py:396] Traceback (most recent call last):
ERROR 04-29 01:36:35 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 04-29 01:36:35 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-29 01:36:35 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 01:36:35 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 04-29 01:36:35 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 04-29 01:36:35 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-29 01:36:35 [core.py:396]     self._initialize_kv_caches(vllm_config)
ERROR 04-29 01:36:35 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
ERROR 04-29 01:36:35 [core.py:396]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
ERROR 04-29 01:36:35 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
ERROR 04-29 01:36:35 [core.py:396]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
ERROR 04-29 01:36:35 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
ERROR 04-29 01:36:35 [core.py:396]     raise ValueError("No available memory for the cache blocks. "
ERROR 04-29 01:36:35 [core.py:396] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
    get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 699, in get_kv_cache_config
    check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 527, in check_enough_kv_cache_memory
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
[rank0]:[W429 01:36:35.780245019 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions