-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
INFO 03-18 19:50:09 [__init__.py:256] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 NVL
GPU 1: NVIDIA H100 NVL
GPU 2: NVIDIA H100 NVL
GPU 3: NVIDIA H100 NVL
Nvidia driver version: 535.183.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6448Y
CPU family: 6
Model: 143
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 96
Stepping: 8
BogoMIPS: 4199.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 4.5 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 192 MiB (96 instances)
L3 cache: 5.6 GiB (96 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.3.0
[pip3] torch==2.6.0+cu124
[pip3] torchaudio==2.6.0+cu124
[pip3] torchvision==0.21.0+cu124
[pip3] transformers==4.48.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.0rc2.dev9+g6eaf1e5c
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB NV12 0-95 0-1 N/A
GPU1 PHB X NV12 PHB 0-95 0-1 N/A
GPU2 PHB NV12 X PHB 0-95 0-1 N/A
GPU3 NV12 PHB PHB X 0-95 0-1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_VISIBLE_DEVICES=/var/run/nvidia-container-devices
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_ALLOW_RUNTIME_LORA_UPDATING=true
LD_LIBRARY_PATH=/opt/intel/oneapi/tbb/latest/lib/intel64/gcc4.8:/opt/intel/oneapi/mkl/latest/lib/intel64:/opt/intel/oneapi/compiler/latest/linux/lib:/opt/intel/oneapi/compiler/latest/linux/lib/x64:/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin:/usr/local/cuda-12.4/lib64
VLLM_LOGGING_LEVEL=INFO
VLLM_USE_V1=0
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
When the model server is launched with --lora-extra-vocab-size 0
(to optimize for LoRA adapter which has not been trained with extra special tokens), the engine falls short on profile run stage:
traceback
INFO 03-17 21:31:59 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-17 21:32:01 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 03-17 21:32:01 [api_server.py:972] vLLM API server version 0.8.0rc2.dev9+g6eaf1e5c
INFO 03-17 21:32:01 [api_server.py:973] args: Namespace(subparser='serve', model_tag='/app/model/Llama-3.3-70B-Instruct/', config='', host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='llama3_json', tool_parser_plugin='', model='/app/model/Llama-3.3-70B-Instruct/', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=131072, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=8, lora_extra_vocab_size=0, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=20, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['meta-llama/Llama-3.3-70B-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f12b6c36950>)
INFO 03-17 21:32:07 [config.py:583] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 03-17 21:32:07 [config.py:1499] Defaulting to use mp for distributed inference
INFO 03-17 21:32:07 [config.py:1677] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 03-17 21:32:07 [config.py:2350] LoRA with chunked prefill is still experimental and may be unstable.
INFO 03-17 21:32:07 [api_server.py:236] Started engine process with PID 289
INFO 03-17 21:32:09 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-17 21:32:11 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
INFO 03-17 21:32:11 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.0rc2.dev9+g6eaf1e5c) with config: model='/app/model/Llama-3.3-70B-Instruct/', speculative_config=None, tokenizer='/app/model/Llama-3.3-70B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=meta-llama/Llama-3.3-70B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 03-17 21:32:11 [multiproc_worker_utils.py:310] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 21:32:11 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-17 21:32:12 [cuda.py:285] Using Flash Attention backend.
INFO 03-17 21:32:14 [__init__.py:256] Automatically detected platform cuda.
INFO 03-17 21:32:14 [__init__.py:256] Automatically detected platform cuda.
INFO 03-17 21:32:14 [__init__.py:256] Automatically detected platform cuda.
WARNING 03-17 21:32:15 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
WARNING 03-17 21:32:15 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:32:15 [multiproc_worker_utils.py:229] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:32:15 [multiproc_worker_utils.py:229] Worker ready; awaiting tasks
WARNING 03-17 21:32:15 [api_server.py:750] LoRA dynamic loading & unloading is enabled in the API server. This should ONLY be used for local development!
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:32:15 [multiproc_worker_utils.py:229] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:32:16 [cuda.py:285] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:32:16 [cuda.py:285] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:32:16 [cuda.py:285] Using Flash Attention backend.
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:32:17 [utils.py:925] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:32:17 [utils.py:925] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:32:17 [utils.py:925] Found nccl from library libnccl.so.2
INFO 03-17 21:32:17 [utils.py:925] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:32:17 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:32:17 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:32:17 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 03-17 21:32:17 [pynccl.py:69] vLLM is using nccl==2.21.5
WARNING 03-17 21:32:18 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=363) [0;0m WARNING 03-17 21:32:18 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=362) [0;0m WARNING 03-17 21:32:18 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=361) [0;0m WARNING 03-17 21:32:18 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 03-17 21:32:18 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_a8ef6225'), local_subscribe_addr='ipc:///tmp/2eb49f57-c5ad-460a-a484-5388d5ee459e', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:32:18 [parallel_state.py:948] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 03-17 21:32:18 [parallel_state.py:948] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:32:18 [parallel_state.py:948] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:32:18 [parallel_state.py:948] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
INFO 03-17 21:32:18 [model_runner.py:1110] Starting to load model /app/model/Llama-3.3-70B-Instruct/...
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:32:18 [model_runner.py:1110] Starting to load model /app/model/Llama-3.3-70B-Instruct/...
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:32:18 [model_runner.py:1110] Starting to load model /app/model/Llama-3.3-70B-Instruct/...
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:32:18 [model_runner.py:1110] Starting to load model /app/model/Llama-3.3-70B-Instruct/...
INFO 03-17 21:37:05 [loader.py:429] Loading weights took 286.19 seconds
INFO 03-17 21:37:05 [punica_selector.py:18] Using PunicaWrapperGPU.
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:37:05 [loader.py:429] Loading weights took 286.41 seconds
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:37:05 [loader.py:429] Loading weights took 286.41 seconds
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:37:05 [loader.py:429] Loading weights took 286.41 seconds
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:37:05 [punica_selector.py:18] Using PunicaWrapperGPU.
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:37:05 [punica_selector.py:18] Using PunicaWrapperGPU.
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:37:05 [punica_selector.py:18] Using PunicaWrapperGPU.
[1;36m(VllmWorkerProcess pid=362) [0;0m INFO 03-17 21:37:05 [model_runner.py:1146] Model loading took 32.9429 GB and 286.704138 seconds
[1;36m(VllmWorkerProcess pid=363) [0;0m INFO 03-17 21:37:05 [model_runner.py:1146] Model loading took 32.9429 GB and 286.708337 seconds
INFO 03-17 21:37:05 [model_runner.py:1146] Model loading took 32.9429 GB and 286.507457 seconds
[1;36m(VllmWorkerProcess pid=361) [0;0m INFO 03-17 21:37:05 [model_runner.py:1146] Model loading took 32.9429 GB and 286.720625 seconds
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] output = run_method(worker, method, args, kwargs)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2216, in run_method
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.model_runner.profile_run()
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self._dummy_run(max_num_batched_tokens, max_num_seqs)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1354, in _dummy_run
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.execute_model(model_input, kv_caches, intermediate_tensors)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1669, in execute_model
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.set_active_loras(model_input.lora_requests,
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1371, in set_active_loras
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 167, in set_active_adapters
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] set_active_adapters_worker(requests, mapping, self._apply_adapters,
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/adapter_commons/utils.py", line 54, in set_active_adapters_worker
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] apply_adapters_func(requests)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 227, in _apply_adapters
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.add_adapter(lora)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 250, in add_adapter
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self._adapter_manager.activate_adapter(lora_request.lora_int_id)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 720, in activate_adapter
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] result = super().activate_adapter(lora_id)
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 405, in activate_adapter
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] module.set_lora(index, module_lora.lora_a, module_lora.lora_b,
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/layers.py", line 223, in set_lora
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.embeddings_tensors[
[1;36m(VllmWorkerProcess pid=361) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] RuntimeError: The size of tensor a (0) must match the size of tensor b (10) at non-singleton dimension 0
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] output = run_method(worker, method, args, kwargs)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2216, in run_method
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.model_runner.profile_run()
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self._dummy_run(max_num_batched_tokens, max_num_seqs)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1354, in _dummy_run
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.execute_model(model_input, kv_caches, intermediate_tensors)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1669, in execute_model
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.set_active_loras(model_input.lora_requests,
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1371, in set_active_loras
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 167, in set_active_adapters
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] set_active_adapters_worker(requests, mapping, self._apply_adapters,
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/adapter_commons/utils.py", line 54, in set_active_adapters_worker
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] apply_adapters_func(requests)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 227, in _apply_adapters
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.add_adapter(lora)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 250, in add_adapter
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self._adapter_manager.activate_adapter(lora_request.lora_int_id)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 720, in activate_adapter
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] result = super().activate_adapter(lora_id)
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 405, in activate_adapter
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] module.set_lora(index, module_lora.lora_a, module_lora.lora_b,
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/layers.py", line 223, in set_lora
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.embeddings_tensors[
[1;36m(VllmWorkerProcess pid=362) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] RuntimeError: The size of tensor a (0) must match the size of tensor b (10) at non-singleton dimension 0
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] output = run_method(worker, method, args, kwargs)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2216, in run_method
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.model_runner.profile_run()
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self._dummy_run(max_num_batched_tokens, max_num_seqs)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1354, in _dummy_run
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.execute_model(model_input, kv_caches, intermediate_tensors)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1669, in execute_model
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.set_active_loras(model_input.lora_requests,
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1371, in set_active_loras
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 167, in set_active_adapters
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] set_active_adapters_worker(requests, mapping, self._apply_adapters,
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/adapter_commons/utils.py", line 54, in set_active_adapters_worker
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] apply_adapters_func(requests)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 227, in _apply_adapters
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.add_adapter(lora)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 250, in add_adapter
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self._adapter_manager.activate_adapter(lora_request.lora_int_id)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 720, in activate_adapter
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] result = super().activate_adapter(lora_id)
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 405, in activate_adapter
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] module.set_lora(index, module_lora.lora_a, module_lora.lora_b,
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/layers.py", line 223, in set_lora
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] self.embeddings_tensors[
[1;36m(VllmWorkerProcess pid=363) [0;0m ERROR 03-17 21:37:05 [multiproc_worker_utils.py:242] RuntimeError: The size of tensor a (0) must match the size of tensor b (10) at non-singleton dimension 0
ERROR 03-17 21:37:06 [engine.py:443] The size of tensor a (0) must match the size of tensor b (10) at non-singleton dimension 0
ERROR 03-17 21:37:06 [engine.py:443] Traceback (most recent call last):
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 431, in run_mp_engine
ERROR 03-17 21:37:06 [engine.py:443] engine = MQLLMEngine.from_vllm_config(
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 126, in from_vllm_config
ERROR 03-17 21:37:06 [engine.py:443] return cls(
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 80, in __init__
ERROR 03-17 21:37:06 [engine.py:443] self.engine = LLMEngine(*args, **kwargs)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 283, in __init__
ERROR 03-17 21:37:06 [engine.py:443] self._initialize_kv_caches()
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 432, in _initialize_kv_caches
ERROR 03-17 21:37:06 [engine.py:443] self.model_executor.determine_num_available_blocks())
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
ERROR 03-17 21:37:06 [engine.py:443] results = self.collective_rpc("determine_num_available_blocks")
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 316, in collective_rpc
ERROR 03-17 21:37:06 [engine.py:443] return self._run_workers(method, *args, **(kwargs or {}))
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 03-17 21:37:06 [engine.py:443] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/utils.py", line 2216, in run_method
ERROR 03-17 21:37:06 [engine.py:443] return func(*args, **kwargs)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-17 21:37:06 [engine.py:443] return func(*args, **kwargs)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 03-17 21:37:06 [engine.py:443] self.model_runner.profile_run()
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-17 21:37:06 [engine.py:443] return func(*args, **kwargs)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 03-17 21:37:06 [engine.py:443] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1354, in _dummy_run
ERROR 03-17 21:37:06 [engine.py:443] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-17 21:37:06 [engine.py:443] return func(*args, **kwargs)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1669, in execute_model
ERROR 03-17 21:37:06 [engine.py:443] self.set_active_loras(model_input.lora_requests,
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1371, in set_active_loras
ERROR 03-17 21:37:06 [engine.py:443] self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 167, in set_active_adapters
ERROR 03-17 21:37:06 [engine.py:443] set_active_adapters_worker(requests, mapping, self._apply_adapters,
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/adapter_commons/utils.py", line 54, in set_active_adapters_worker
ERROR 03-17 21:37:06 [engine.py:443] apply_adapters_func(requests)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 227, in _apply_adapters
ERROR 03-17 21:37:06 [engine.py:443] self.add_adapter(lora)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 250, in add_adapter
ERROR 03-17 21:37:06 [engine.py:443] self._adapter_manager.activate_adapter(lora_request.lora_int_id)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 720, in activate_adapter
ERROR 03-17 21:37:06 [engine.py:443] result = super().activate_adapter(lora_id)
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 405, in activate_adapter
ERROR 03-17 21:37:06 [engine.py:443] module.set_lora(index, module_lora.lora_a, module_lora.lora_b,
ERROR 03-17 21:37:06 [engine.py:443] File "/app/.venv/lib/python3.10/site-packages/vllm/lora/layers.py", line 223, in set_lora
ERROR 03-17 21:37:06 [engine.py:443] self.embeddings_tensors[
ERROR 03-17 21:37:06 [engine.py:443] RuntimeError: The size of tensor a (0) must match the size of tensor b (10) at non-singleton dimension 0
If --lora-extra-vocab-size
has not been set or set to nonzero (and no other arguments changed), the model server runs okay.
Minimal working example
VLLM_USE_V1=0 vllm serve /app/model/Llama-3.3-70B-Instruct/ --served-model-name meta-llama/Llama-3.3-70B-Instruct --gpu-memory-utilization 0.96 --tensor-parallel-size 4 --max-model-len 131072 --enable-chunked-prefill --max-num-batched-tokens 8192 --enable-auto-tool-choice --tool-call-parser llama3_json --enable-lora --fully-sharded-lora --max-loras 1 --max-lora-rank 8 --lora-extra-vocab-size 0
(It doesn't even have to load pretrained LoRA adapter, the script fails anyway)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working