-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Open
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity
Description
Your current environment
The output of `python collect_env.py`
INFO 04-20 16:18:53 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.12.10 (main, Apr 9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.119-19.0009.40-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L40
GPU 1: NVIDIA L40
GPU 2: NVIDIA L40
GPU 3: NVIDIA L40
Nvidia driver version: 535.161.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9K84 96-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Stepping: 0
BogoMIPS: 5200.02
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 6 MiB (192 instances)
L1i cache: 6 MiB (192 instances)
L2 cache: 192 MiB (192 instances)
L3 cache: 768 MiB (24 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-191
NUMA node1 CPU(s): 192-383
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] numpy==2.2.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX NODE SYS 0-191 0 N/A
GPU1 PIX X NODE SYS 0-191 0 N/A
GPU2 NODE NODE X SYS 0-191 0 N/A
GPU3 SYS SYS SYS X 192-383 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_VISIBLE_DEVICES=GPU-8b2915b7-6fa3-7a6e-f265-1739493bb256,GPU-790d8ecb-b336-1bb8-a062-3d41c02266db,GPU-69a60cf4-58ad-f20b-55bd-f4f61a3f91f9,GPU-8064c094-9d29-79ce-ff31-3a85ba7d9db4
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.20.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
vLLM 0.8.3/0.8.4:
In v0 mode, ngram speculative decoding works as expected.
In v1 mode, ngram speculative decoding does not work.
vLLM 0.7.3:
Both v0/v1 mode work as expected.
Let me know if I missed anything.
use v0
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve /data/code/Qwen2.5-Coder-32B-Instruct \
--host 0.0.0.0 \
--port 8080 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--dtype auto \
--tensor-parallel-size 4 \
--max-num-batched-tokens 131072 \
--max-model-len 131072 \
--max-num-seqs 8 \
--enable-prefix-caching \
--speculative-config '{"method": "ngram", "prompt_lookup_min": 10, "prompt_lookup_max": 50, "num_speculative_tokens": 300}' \
--rope-scaling '{ "factor": 4.0, "original_max_position_embeddings": 32768, "rope_type": "yarn" }' \
--enforce-eager
I can see that SpecDecodeWorker is active in the logs, and the completions are extremely fast.
INFO 04-20 16:25:45 [spec_decode_worker.py:221] [Speculative Decoding] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
The output of `vllm serve`
INFO 04-20 16:25:33 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 16:25:34 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-20 16:25:34 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/data/code/Qwen2.5-Coder-32B-Instruct', config='', host='0.0.0.0', port=8080, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/code/Qwen2.5-Coder-32B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=131072, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=131072, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=8, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'factor': 4.0, 'original_max_position_embeddings': 32768, 'rope_type': 'yarn'}, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config={'method': 'ngram', 'prompt_lookup_min': 10, 'prompt_lookup_max': 50, 'num_speculative_tokens': 300}, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f84b41f8360>)
INFO 04-20 16:25:34 [config.py:428] Overriding HF config with {'rope_scaling': {'factor': 4.0, 'original_max_position_embeddings': 32768, 'rope_type': 'yarn'}}
INFO 04-20 16:25:39 [config.py:689] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-20 16:25:40 [arg_utils.py:1742] ngram is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 04-20 16:25:40 [arg_utils.py:1603] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 04-20 16:25:40 [config.py:1713] Defaulting to use mp for distributed inference
WARNING 04-20 16:25:40 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-20 16:25:40 [api_server.py:246] Started engine process with PID 1793
INFO 04-20 16:25:42 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 16:25:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/data/code/Qwen2.5-Coder-32B-Instruct', speculative_config=SpeculativeConfig(method='ngram', model=None, num_spec_tokens=300), tokenizer='/data/code/Qwen2.5-Coder-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/data/code/Qwen2.5-Coder-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
WARNING 04-20 16:25:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-20 16:25:45 [cuda.py:292] Using Flash Attention backend.
WARNING 04-20 16:25:45 [utils.py:2444] Methods determine_num_available_blocks,device_config not implemented in <vllm.spec_decode.ngram_worker.NGramWorker object at 0x7f2c144ecf20>
INFO 04-20 16:25:45 [spec_decode_worker.py:209] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
INFO 04-20 16:25:45 [rejection_sampler.py:60] Use pytorch for rejection sampling.
INFO 04-20 16:25:45 [spec_decode_worker.py:221] [Speculative Decoding] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
INFO 04-20 16:25:46 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 16:25:46 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 16:25:46 [__init__.py:239] Automatically detected platform cuda.
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:47 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:47 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:47 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:48 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=1866) WARNING 04-20 16:25:48 [utils.py:2444] Methods determine_num_available_blocks,device_config not implemented in <vllm.spec_decode.ngram_worker.NGramWorker object at 0x7f82f0c4acc0>
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:48 [spec_decode_worker.py:209] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:48 [rejection_sampler.py:60] Use pytorch for rejection sampling.
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:48 [spec_decode_worker.py:221] [Speculative Decoding] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:48 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:48 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=1867) WARNING 04-20 16:25:48 [utils.py:2444] Methods determine_num_available_blocks,device_config not implemented in <vllm.spec_decode.ngram_worker.NGramWorker object at 0x7f9ac0b92660>
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:48 [spec_decode_worker.py:209] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:48 [rejection_sampler.py:60] Use pytorch for rejection sampling.
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:48 [spec_decode_worker.py:221] [Speculative Decoding] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=1865) WARNING 04-20 16:25:48 [utils.py:2444] Methods determine_num_available_blocks,device_config not implemented in <vllm.spec_decode.ngram_worker.NGramWorker object at 0x7f1e649469c0>
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:48 [spec_decode_worker.py:209] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.ngram_worker.NGramWorker'>
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:48 [rejection_sampler.py:60] Use pytorch for rejection sampling.
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:48 [spec_decode_worker.py:221] [Speculative Decoding] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:49 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-20 16:25:49 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:49 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:49 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-20 16:25:49 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:49 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:49 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:49 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=1866) WARNING 04-20 16:25:49 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 04-20 16:25:49 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1867) WARNING 04-20 16:25:49 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1865) WARNING 04-20 16:25:49 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-20 16:25:49 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_326ecaed'), local_subscribe_addr='ipc:///tmp/a176a251-764c-438a-9721-a72d80d8024a', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:49 [parallel_state.py:959] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:49 [parallel_state.py:959] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:49 [parallel_state.py:959] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:49 [model_runner.py:1110] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:49 [model_runner.py:1110] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:49 [model_runner.py:1110] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
INFO 04-20 16:25:49 [parallel_state.py:959] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-20 16:25:49 [model_runner.py:1110] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:04, 2.78it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:00<00:05, 2.27it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:01<00:05, 2.19it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:01<00:04, 2.22it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:02<00:04, 2.21it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:02<00:03, 2.17it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:03<00:03, 2.23it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:03<00:02, 2.27it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:04<00:02, 2.23it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:04<00:01, 2.17it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:05<00:01, 2.06it/s]
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:55 [loader.py:458] Loading weights took 5.57 seconds
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:05<00:01, 1.98it/s]
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:55 [loader.py:458] Loading weights took 5.72 seconds
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:55 [model_runner.py:1146] Model loading took 15.4132 GiB and 5.785068 seconds
(VllmWorkerProcess pid=1865) INFO 04-20 16:25:55 [spec_decode_worker.py:382] [Speculative Decoding] Use MQA scorer for scoring proposals.
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:56 [model_runner.py:1146] Model loading took 15.4132 GiB and 5.909193 seconds
(VllmWorkerProcess pid=1867) INFO 04-20 16:25:56 [spec_decode_worker.py:382] [Speculative Decoding] Use MQA scorer for scoring proposals.
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:06<00:00, 2.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:06<00:00, 2.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:06<00:00, 2.24it/s]
INFO 04-20 16:25:56 [loader.py:458] Loading weights took 6.28 seconds
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:56 [loader.py:458] Loading weights took 6.50 seconds
INFO 04-20 16:25:56 [model_runner.py:1146] Model loading took 15.4132 GiB and 6.468552 seconds
INFO 04-20 16:25:56 [spec_decode_worker.py:382] [Speculative Decoding] Use MQA scorer for scoring proposals.
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:56 [model_runner.py:1146] Model loading took 15.4132 GiB and 6.686331 seconds
(VllmWorkerProcess pid=1866) INFO 04-20 16:25:56 [spec_decode_worker.py:382] [Speculative Decoding] Use MQA scorer for scoring proposals.
(VllmWorkerProcess pid=1866) INFO 04-20 16:26:30 [worker.py:267] Memory profiling takes 33.55 seconds
(VllmWorkerProcess pid=1866) INFO 04-20 16:26:30 [worker.py:267] the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
(VllmWorkerProcess pid=1866) INFO 04-20 16:26:30 [worker.py:267] model weights take 15.41GiB; non_torch_memory takes 0.37GiB; PyTorch activation peak memory takes 11.32GiB; the rest of the memory reserved for KV Cache is 12.78GiB.
(VllmWorkerProcess pid=1867) INFO 04-20 16:26:30 [worker.py:267] Memory profiling takes 33.59 seconds
(VllmWorkerProcess pid=1867) INFO 04-20 16:26:30 [worker.py:267] the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
(VllmWorkerProcess pid=1867) INFO 04-20 16:26:30 [worker.py:267] model weights take 15.41GiB; non_torch_memory takes 0.33GiB; PyTorch activation peak memory takes 11.32GiB; the rest of the memory reserved for KV Cache is 12.82GiB.
(VllmWorkerProcess pid=1865) INFO 04-20 16:26:30 [worker.py:267] Memory profiling takes 33.60 seconds
(VllmWorkerProcess pid=1865) INFO 04-20 16:26:30 [worker.py:267] the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
(VllmWorkerProcess pid=1865) INFO 04-20 16:26:30 [worker.py:267] model weights take 15.41GiB; non_torch_memory takes 0.37GiB; PyTorch activation peak memory takes 11.32GiB; the rest of the memory reserved for KV Cache is 12.78GiB.
INFO 04-20 16:26:30 [worker.py:267] Memory profiling takes 33.63 seconds
INFO 04-20 16:26:30 [worker.py:267] the current vLLM instance can use total_gpu_memory (44.32GiB) x gpu_memory_utilization (0.90) = 39.89GiB
INFO 04-20 16:26:30 [worker.py:267] model weights take 15.41GiB; non_torch_memory takes 0.42GiB; PyTorch activation peak memory takes 11.32GiB; the rest of the memory reserved for KV Cache is 12.74GiB.
INFO 04-20 16:26:31 [executor_base.py:112] # cuda blocks: 13042, # CPU blocks: 4096
INFO 04-20 16:26:31 [executor_base.py:117] Maximum concurrency for 131072 tokens per request: 1.59x
INFO 04-20 16:26:32 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 36.00 seconds
WARNING 04-20 16:26:33 [config.py:1177] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-20 16:26:33 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-20 16:26:33 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-20 16:26:33 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8080
use v1
export VLLM_USE_V1=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve /data/code/Qwen2.5-Coder-32B-Instruct \
--host 0.0.0.0 \
--port 8080 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--dtype auto \
--tensor-parallel-size 4 \
--max-num-batched-tokens 131072 \
--max-model-len 131072 \
--max-num-seqs 8 \
--enable-prefix-caching \
--speculative-config '{"method": "ngram", "prompt_lookup_min": 10, "prompt_lookup_max": 50, "num_speculative_tokens": 300}' \
--rope-scaling '{ "factor": 4.0, "original_max_position_embeddings": 32768, "rope_type": "yarn" }' \
--enforce-eager
However, in v1, I do not see any indication that SpecDecodeWorker is running in the logs, the completions is slow.
WARNING 04-20 16:37:32 [arg_utils.py:1736] Detected VLLM_USE_V1=1 with ngram. Usage should be considered experimental. Please report any issues on Github.
The output of `vllm serve`
INFO 04-20 16:37:24 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 16:37:26 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-20 16:37:26 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/data/code/Qwen2.5-Coder-32B-Instruct', config='', host='0.0.0.0', port=8080, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/code/Qwen2.5-Coder-32B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=131072, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=131072, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=8, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'factor': 4.0, 'original_max_position_embeddings': 32768, 'rope_type': 'yarn'}, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config={'method': 'ngram', 'prompt_lookup_min': 10, 'prompt_lookup_max': 50, 'num_speculative_tokens': 300}, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fbb35c46340>)
INFO 04-20 16:37:26 [config.py:428] Overriding HF config with {'rope_scaling': {'factor': 4.0, 'original_max_position_embeddings': 32768, 'rope_type': 'yarn'}}
INFO 04-20 16:37:32 [config.py:689] This model supports multiple tasks: {'classify', 'embed', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
WARNING 04-20 16:37:32 [arg_utils.py:1736] Detected VLLM_USE_V1=1 with ngram. Usage should be considered experimental. Please report any issues on Github.
INFO 04-20 16:37:32 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-20 16:37:32 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=131072.
WARNING 04-20 16:37:32 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-20 16:37:35 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 16:37:37 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='/data/code/Qwen2.5-Coder-32B-Instruct', speculative_config=SpeculativeConfig(method='ngram', model=None, num_spec_tokens=300), tokenizer='/data/code/Qwen2.5-Coder-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/data/code/Qwen2.5-Coder-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-20 16:37:37 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-20 16:37:37 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_89d32e6f'), local_subscribe_addr='ipc:///tmp/bdbae752-653a-4063-a590-241a8c0e9e0c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-20 16:37:39 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-20 16:37:41 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7feb4e3b8530>
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:41 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ba1d3739'), local_subscribe_addr='ipc:///tmp/5a1c9e61-4fef-40c8-9034-6363b9345c95', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-20 16:37:44 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-20 16:37:46 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ff5b76523f0>
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:46 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1896b6d4'), local_subscribe_addr='ipc:///tmp/39f57a7d-ebde-4c20-a188-5687c77310e6', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-20 16:37:49 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-20 16:37:51 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f227f73deb0>
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:51 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_d7d6af87'), local_subscribe_addr='ipc:///tmp/7c97333a-e12c-475b-8c46-f9864c5ed8dc', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-20 16:37:53 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-20 16:37:56 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f6e57c035f0>
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:56 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_672cbaee'), local_subscribe_addr='ipc:///tmp/baa2a9ab-eb5a-45b2-8fe8-0a6dede602e4', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:56 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:56 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:56 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:56 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:56 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:56 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:56 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:56 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=2456) WARNING 04-20 16:37:56 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=1 pid=2413) WARNING 04-20 16:37:56 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=2396) WARNING 04-20 16:37:56 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=2 pid=2432) WARNING 04-20 16:37:56 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:56 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_1b8ac1ef'), local_subscribe_addr='ipc:///tmp/b8dd842e-3ced-4003-9ec6-4025c9c8b787', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:56 [parallel_state.py:959] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:56 [parallel_state.py:959] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:56 [parallel_state.py:959] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:56 [parallel_state.py:959] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:56 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:56 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:56 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:56 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:57 [gpu_model_runner.py:1276] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:57 [gpu_model_runner.py:1276] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:57 [gpu_model_runner.py:1276] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:57 [gpu_model_runner.py:1276] Starting to load model /data/code/Qwen2.5-Coder-32B-Instruct...
(VllmWorker rank=1 pid=2413) INFO 04-20 16:37:57 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=3 pid=2456) INFO 04-20 16:37:57 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2396) INFO 04-20 16:37:57 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=2 pid=2432) INFO 04-20 16:37:57 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:04, 2.81it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:00<00:05, 2.24it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:01<00:05, 2.19it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:01<00:04, 2.20it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:02<00:04, 2.18it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:02<00:03, 2.14it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:03<00:03, 2.15it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:03<00:02, 2.22it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:04<00:02, 2.16it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:04<00:01, 2.11it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:05<00:01, 2.02it/s]
(VllmWorker rank=3 pid=2456) INFO 04-20 16:38:02 [loader.py:458] Loading weights took 5.45 seconds
(VllmWorker rank=3 pid=2456) INFO 04-20 16:38:02 [gpu_model_runner.py:1287] Loading drafter model...
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:05<00:01, 1.93it/s]
(VllmWorker rank=3 pid=2456) INFO 04-20 16:38:03 [gpu_model_runner.py:1291] Model loading took 15.4132 GiB and 5.634231 seconds
(VllmWorker rank=1 pid=2413) INFO 04-20 16:38:03 [loader.py:458] Loading weights took 6.02 seconds
(VllmWorker rank=1 pid=2413) INFO 04-20 16:38:03 [gpu_model_runner.py:1287] Loading drafter model...
(VllmWorker rank=2 pid=2432) INFO 04-20 16:38:03 [loader.py:458] Loading weights took 6.07 seconds
(VllmWorker rank=2 pid=2432) INFO 04-20 16:38:03 [gpu_model_runner.py:1287] Loading drafter model...
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:06<00:00, 1.95it/s]
(VllmWorker rank=1 pid=2413) INFO 04-20 16:38:03 [gpu_model_runner.py:1291] Model loading took 15.4132 GiB and 6.194564 seconds
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:06<00:00, 2.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:06<00:00, 2.18it/s]
(VllmWorker rank=0 pid=2396)
(VllmWorker rank=2 pid=2432) INFO 04-20 16:38:03 [gpu_model_runner.py:1291] Model loading took 15.4132 GiB and 6.251333 seconds
(VllmWorker rank=0 pid=2396) INFO 04-20 16:38:03 [loader.py:458] Loading weights took 6.44 seconds
(VllmWorker rank=0 pid=2396) INFO 04-20 16:38:03 [gpu_model_runner.py:1287] Loading drafter model...
(VllmWorker rank=0 pid=2396) INFO 04-20 16:38:04 [gpu_model_runner.py:1291] Model loading took 15.4132 GiB and 6.620071 seconds
INFO 04-20 16:38:34 [kv_cache_utils.py:634] GPU KV cache size: 163,456 tokens
INFO 04-20 16:38:34 [kv_cache_utils.py:637] Maximum concurrency for 131,072 tokens per request: 1.25x
INFO 04-20 16:38:34 [kv_cache_utils.py:634] GPU KV cache size: 163,328 tokens
INFO 04-20 16:38:34 [kv_cache_utils.py:637] Maximum concurrency for 131,072 tokens per request: 1.25x
INFO 04-20 16:38:34 [kv_cache_utils.py:634] GPU KV cache size: 163,328 tokens
INFO 04-20 16:38:34 [kv_cache_utils.py:637] Maximum concurrency for 131,072 tokens per request: 1.25x
INFO 04-20 16:38:34 [kv_cache_utils.py:634] GPU KV cache size: 163,968 tokens
INFO 04-20 16:38:34 [kv_cache_utils.py:637] Maximum concurrency for 131,072 tokens per request: 1.25x
INFO 04-20 16:38:35 [core.py:163] init engine (profile, create kv cache, warmup model) took 30.65 seconds
INFO 04-20 16:38:35 [core_client.py:435] Core engine process 0 ready.
WARNING 04-20 16:38:35 [config.py:1177] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-20 16:38:35 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-20 16:38:35 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-20 16:38:35 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8080
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
lachlancahill
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity