-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Closed as not planned
Closed as not planned
Copy link
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity
Description
Your current environment
The output of `python collect_env.py`
INFO 03-01 00:48:13 [__init__.py:207] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB
Nvidia driver version: 550.127.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 240
On-line CPU(s) list: 0-239
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7J13 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 120
Socket(s): 2
Stepping: 1
BogoMIPS: 4899.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 15 MiB (240 instances)
L1i cache: 15 MiB (240 instances)
L2 cache: 120 MiB (240 instances)
L3 cache: 3.8 GiB (240 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-119
NUMA node1 CPU(s): 120-239
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchac_cuda==0.2.5
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.4.dev160+g28943d36
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PHB 0-239 0-1 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PHB 0-239 0-1 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PHB 0-239 0-1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PHB 0-239 0-1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 PHB 0-239 0-1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 PHB 0-239 0-1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 PHB 0-239 0-1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X PHB 0-239 0-1 N/A
NIC0 PHB PHB PHB PHB PHB PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
LD_LIBRARY_PATH=/root/vllm/lib/python3.10/site-packages/cv2/../../lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
TL;DR: This bug will happen when running MLA models (i.e., deepseek R1) with
- enable prefixing caching
- disable chunked prefill
Detailed explanation:
In #12639 , a few new variables related to chunked prefill and a new function called _compute_prefill_context
are introduced in class MLACommonBackend
.
The _compute_prefill_context
function will be called when prefix caching is enabled. It uses the new variables related to chunked prefill (e.g., context_chunk_cu_seq_lens
, context_chunk_starts
, context_chunk_seq_tot
, and context_chunk_max_seq_lens
).
However, if chunked prefill is not enabled, those variables will be None and cause the above assertion error.
Example script:
vllm serve cognitivecomputations/DeepSeek-R1-AWQ \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--dtype float16 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--enforce-eager \
--disable-log-requests \
--enable-chunked-prefill false \
--enable-prefix-caching
To reproduce the bug, send two requests to the serving engine with the same prefix.
Error logs:
Key part:
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1568, in forward
ERROR 03-01 00:31:53 [engine.py:141] output[:num_prefill_tokens] = self._forward_prefill(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1462, in _forward_prefill
ERROR 03-01 00:31:53 [engine.py:141] context_output, context_lse = self._compute_prefill_context( \
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1280, in _compute_prefill_context
ERROR 03-01 00:31:53 [engine.py:141] assert prefill_metadata.context_chunk_seq_tot is not None
ERROR 03-01 00:31:53 [engine.py:141] AssertionError
**Full error log** (just showing one of the workers, the remaining workers are the same):
ERROR 03-01 00:31:53 [engine.py:141] AssertionError()
ERROR 03-01 00:31:53 [engine.py:141] Traceback (most recent call last):
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in start
ERROR 03-01 00:31:53 [engine.py:141] self.run_engine_loop()
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 202, in run_engine_loop
ERROR 03-01 00:31:53 [engine.py:141] request_outputs = self.engine_step()
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 228, in engine_step
ERROR 03-01 00:31:53 [engine.py:141] raise e
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 211, in engine_step
ERROR 03-01 00:31:53 [engine.py:141] return self.engine.step()
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1401, in step
ERROR 03-01 00:31:53 [engine.py:141] outputs = self.model_executor.execute_model(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 284, in execute_model
ERROR 03-01 00:31:53 [engine.py:141] driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 03-01 00:31:53 [engine.py:141] return self.driver_worker.execute_model(execute_model_req)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 03-01 00:31:53 [engine.py:141] output = self.model_runner.execute_model(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-01 00:31:53 [engine.py:141] return func(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1742, in execute_model
ERROR 03-01 00:31:53 [engine.py:141] hidden_or_intermediate_states = model_executable(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141] return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141] return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 669, in forward
ERROR 03-01 00:31:53 [engine.py:141] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 03-01 00:31:53 [engine.py:141] return self.forward(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 626, in forward
ERROR 03-01 00:31:53 [engine.py:141] hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141] return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141] return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 548, in forward
ERROR 03-01 00:31:53 [engine.py:141] hidden_states = self.self_attn(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141] return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141] return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 467, in forward
ERROR 03-01 00:31:53 [engine.py:141] return self.mla_attn(hidden_states_or_q_c,
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141] return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141] return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 223, in forward
ERROR 03-01 00:31:53 [engine.py:141] return torch.ops.vllm.unified_attention(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
ERROR 03-01 00:31:53 [engine.py:141] return self._op(*args, **(kwargs or {}))
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 329, in unified_attention
ERROR 03-01 00:31:53 [engine.py:141] return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1568, in forward
ERROR 03-01 00:31:53 [engine.py:141] output[:num_prefill_tokens] = self._forward_prefill(
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1462, in _forward_prefill
ERROR 03-01 00:31:53 [engine.py:141] context_output, context_lse = self._compute_prefill_context( \
ERROR 03-01 00:31:53 [engine.py:141] File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1280, in _compute_prefill_context
ERROR 03-01 00:31:53 [engine.py:141] assert prefill_metadata.context_chunk_seq_tot is not None
ERROR 03-01 00:31:53 [engine.py:141] AssertionError
ERROR 03-01 00:31:53 [serving_chat.py:665] Error in chat completion stream generator.
ERROR 03-01 00:31:53 [serving_chat.py:665] Traceback (most recent call last):
ERROR 03-01 00:31:53 [serving_chat.py:665] File "/root/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 363, in chat_completion_stream_generator
ERROR 03-01 00:31:53 [serving_chat.py:665] async for res in result_generator:
ERROR 03-01 00:31:53 [serving_chat.py:665] File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 659, in _process_request
ERROR 03-01 00:31:53 [serving_chat.py:665] raise request_output
ERROR 03-01 00:31:53 [serving_chat.py:665] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: AssertionError().
@LucasWilkinson @pathorn @simon-mo @tlrmchlsmth PTAL 🙏
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity