-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
GPU 2080max-q
vLLM API server version 0.10.2rc2.dev104
RUN unsloth/gpt-oss-20b-unsloth-bnb-4bit
ERROR
INFO 09-05 20:50:59 [init.py:241] Automatically detected platform cuda.
(APIServer pid=1) INFO 09-05 20:51:01 [api_server.py:1894] vLLM API server version 0.10.2rc2.dev104+gc954c6629
(APIServer pid=1) INFO 09-05 20:51:01 [utils.py:328] non-default args: {'host': '0.0.0.0', 'model': '/root/cache/gpt-oss-20b'}
(APIServer pid=1) INFO 09-05 20:51:08 [init.py:748] Resolved architecture: GptOssForCausalLM
(APIServer pid=1) INFO 09-05 20:51:08 [init.py:1786] Using max model len 131072
(APIServer pid=1) WARNING 09-05 20:51:10 [_ipex_ops.py:16] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=1) WARNING 09-05 20:51:10 [init.py:1222] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=1) INFO 09-05 20:51:10 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 09-05 20:51:10 [config.py:276] Overriding max cuda graph capture size to 1024 for performance.
INFO 09-05 20:51:16 [init.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=62) INFO 09-05 20:51:18 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=62) INFO 09-05 20:51:18 [core.py:76] Initializing a V1 LLM engine (v0.10.2rc2.dev104+gc954c6629) with config: model='/root/cache/gpt-oss-20b', speculative_config=None, tokenizer='/root/cache/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai_gptoss'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/cache/gpt-oss-20b, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=62) WARNING 09-05 20:51:18 [interface.py:393] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore_0 pid=62) W0905 20:51:18.571000 62 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=62) W0905 20:51:18.571000 62 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(EngineCore_0 pid=62) ERROR 09-05 20:51:18 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
[W905 20:51:20.845275539 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=62) INFO 09-05 20:51:20 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=62) INFO 09-05 20:51:20 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=62) INFO 09-05 20:51:20 [gpu_model_runner.py:1922] Starting to load model /root/cache/gpt-oss-20b...
(EngineCore_0 pid=62) INFO 09-05 20:51:20 [gpu_model_runner.py:1954] Loading model from scratch...
(EngineCore_0 pid=62) INFO 09-05 20:51:20 [cuda.py:346] Using FlexAttention backend on V1 engine.
(EngineCore_0 pid=62) Process EngineCore_0:
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self._init_executor()
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.collective_rpc("load_model")
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3045, in run_method
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1955, in load_model
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.model = model_loader.load_model(
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 651, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.model = GptOssModel(
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 199, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 230, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 232, in
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] lambda prefix: TransformerBlock(
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 180, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.attn = OAIAttention(config,
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 109, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.attn = Attention(
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 182, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flex_attention.py", line 644, in init
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] raise NotImplementedError(
(EngineCore_0 pid=62) ERROR 09-05 20:51:21 [core.py:718] NotImplementedError: FlexAttention does not support sliding window yet.
(EngineCore_0 pid=62) Traceback (most recent call last):
(EngineCore_0 pid=62) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=62) self.run()
(EngineCore_0 pid=62) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=62) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=62) raise e
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=62) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_0 pid=62) super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=62) self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=62) self._init_executor()
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=62) self.collective_rpc("load_model")
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=62) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3045, in run_method
(EngineCore_0 pid=62) return func(*args, **kwargs)
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(EngineCore_0 pid=62) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1955, in load_model
(EngineCore_0 pid=62) self.model = model_loader.load_model(
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=62) model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=62) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 651, in init
(EngineCore_0 pid=62) self.model = GptOssModel(
(EngineCore_0 pid=62) ^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 199, in init
(EngineCore_0 pid=62) old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 230, in init
(EngineCore_0 pid=62) self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_0 pid=62) ^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(EngineCore_0 pid=62) maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 232, in
(EngineCore_0 pid=62) lambda prefix: TransformerBlock(
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 180, in init
(EngineCore_0 pid=62) self.attn = OAIAttention(config,
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 109, in init
(EngineCore_0 pid=62) self.attn = Attention(
(EngineCore_0 pid=62) ^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 182, in init
(EngineCore_0 pid=62) self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=62) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=62) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flex_attention.py", line 644, in init
(EngineCore_0 pid=62) raise NotImplementedError(
(EngineCore_0 pid=62) NotImplementedError: FlexAttention does not support sliding window yet.
[rank0]:[W905 20:51:21.138490629 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "", line 198, in _run_module_as_main
(APIServer pid=1) File "", line 88, in _run_code
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 2009, in
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1939, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1959, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 1587, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 204, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 129, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 767, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 446, in init
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status