[Bug]: TimeoutError During Benchmark Profiling with Torch Profiler on vLLM v0.6.0 #8326
Closed
Description
Your current environment
The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.7
Libc version: glibc-2.35
Python version: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB
Nvidia driver version: 470.182.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 232
On-line CPU(s) list: 0-231
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7K83 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 58
Socket(s): 2
Stepping: 1
BogoMIPS: 4890.80
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 3.6 MiB (116 instances)
L1i cache: 3.6 MiB (116 instances)
L2 cache: 58 MiB (116 instances)
L3 cache: 512 MiB (16 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-115
NUMA node1 CPU(s): 116-231
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-cublas-cu11==11.10.3.66
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu11==11.7.101
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu11==11.7.99
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu11==11.7.99
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu11==8.5.0.96
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu11==10.9.0.58
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu11==10.2.10.91
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu11==11.4.0.1
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu11==11.7.4.91
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-dali-cuda120==1.36.0
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu11==2.14.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvimgcodec-cu12==0.2.0.7
[pip3] nvidia-nvjitlink-cu12==12.4.99
[pip3] nvidia-nvtx-cu11==11.7.91
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==3.0.0
[pip3] tritonclient==2.43.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi
[conda] nvidia-nvtx-cu11 11.7.91 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pynvml 11.5.0 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.44.2 pypi_0 pypi
[conda] transformers-stream-generator 0.0.4 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi
[conda] tritonclient 2.43.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity
GPU0 X 116-231 1
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
I am attempting to profile the performance of vLLM v0.6.0 by following the vLLM profiling documentation.
Here’s the process I followed:
- Set up the profiler environment:
export VLLM_TORCH_PROFILER_DIR=/app/vllm_profile
- Launched the OpenAI server:
python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 1 \
--model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 \
--trust-remote-code --max-model-len 8192
- Launched the benchmark process:
python benchmark_serving.py --backend vllm --model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 \
--dataset-name sharegpt --dataset-path sharegpt.json --num-prompts 2 --profile
- Error Encountered:
While running the benchmark, I encountered a TimeoutError: Server didn't reply within 5000 ms in the vLLM OpenAI server log. The relevant section of the log is as follows:
WARNING 09-10 16:07:55 api_server.py:327] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
INFO 09-10 16:07:55 api_server.py:459] vLLM API server version 0.6.0
INFO 09-10 16:07:55 api_server.py:460] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, collect_detailed_traces=None, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=False, enable_prompt_adapter=False, enforce_eager=False, engine_use_ray=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=8192, max_num_batched_tokens=None, max_num_seqs=256, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', model_loader_extra_config=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=1, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=8000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, trust_remote_code=True, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 09-10 16:07:55 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/961fbd3d-8728-4690-b8fe-0026ccbf61dd for RPC Path.
INFO 09-10 16:07:55 api_server.py:176] Started engine process with PID 12240
WARNING 09-10 16:07:58 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 09-10 16:08:01 api_server.py:327] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
INFO 09-10 16:08:01 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
INFO 09-10 16:08:02 worker.py:124] Profiling enabled. Traces will be saved to: /app/vllm_profiler
INFO 09-10 16:08:02 model_runner.py:915] Starting to load model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8...
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.62s/it]
INFO 09-10 16:08:06 model_runner.py:926] Loading model weights took 8.4939 GB
INFO 09-10 16:08:07 gpu_executor.py:122] # GPU blocks: 31260, # CPU blocks: 2048
INFO 09-10 16:08:09 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-10 16:08:09 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-10 16:08:21 model_runner.py:1335] Graph capturing finished in 12 secs.
INFO 09-10 16:08:22 api_server.py:224] vLLM to use /tmp/tmp_i4t0w55 as PROMETHEUS_MULTIPROC_DIR
WARNING 09-10 16:08:22 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-10 16:08:22 launcher.py:20] Available routes are:
INFO 09-10 16:08:22 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-10 16:08:22 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-10 16:08:22 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-10 16:08:22 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-10 16:08:22 launcher.py:28] Route: /health, Methods: GET
INFO 09-10 16:08:22 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-10 16:08:22 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-10 16:08:22 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-10 16:08:22 launcher.py:28] Route: /version, Methods: GET
INFO 09-10 16:08:22 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-10 16:08:22 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-10 16:08:22 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-10 16:08:22 launcher.py:28] Route: /start_profile, Methods: POST
INFO 09-10 16:08:22 launcher.py:28] Route: /stop_profile, Methods: POST
INFO 09-10 16:08:22 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO: Started server process [12171]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 09-10 16:08:32 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-10 16:10:21 logger.py:36] Received request cmpl-11174501eb824056b9a481a9ca45afad-0: prompt: 'Do you know the book Traction by Gino Wickman', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=119, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [5519, 499, 1440, 279, 2363, 350, 16597, 555, 480, 3394, 75206, 1543], lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:45336 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 09-10 16:10:21 async_llm_engine.py:206] Added request cmpl-11174501eb824056b9a481a9ca45afad-0.
INFO 09-10 16:10:21 metrics.py:351] Avg prompt throughput: 1.3 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-10 16:10:22 async_llm_engine.py:174] Finished request cmpl-11174501eb824056b9a481a9ca45afad-0.
INFO 09-10 16:10:22 async_llm_engine.py:174] Finished request cmpl-11174501eb824056b9a481a9ca45afad-0.
INFO 09-10 16:10:22 api_server.py:333] Starting profiler...
INFO 09-10 16:10:22 server.py:134] Starting profiler...
INFO: 127.0.0.1:45346 - "POST /start_profile HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/starlette/routing.py", line 74, in app
response = await func(request)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/entrypoints/openai/api_server.py", line 334, in start_profile
await async_engine_client.start_profile()
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/entrypoints/openai/rpc/client.py", line 442, in start_profile
await self._send_one_way_rpc_request(
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/entrypoints/openai/rpc/client.py", line 258, in _send_one_way_rpc_request
response = await do_rpc_call(socket, request)
File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/vllm/entrypoints/openai/rpc/client.py", line 248, in do_rpc_call
raise TimeoutError("Server didn't reply within "
TimeoutError: Server didn't reply within 5000 ms
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.