Skip to content

[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled" #14069

@ApostaC

Description

@ApostaC

Your current environment

The output of `python collect_env.py`
INFO 03-01 00:48:13 [__init__.py:207] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 550.127.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               240
On-line CPU(s) list:                  0-239
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7J13 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   120
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             4899.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities
Virtualization:                       AMD-V
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            15 MiB (240 instances)
L1i cache:                            15 MiB (240 instances)
L2 cache:                             120 MiB (240 instances)
L3 cache:                             3.8 GiB (240 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-119
NUMA node1 CPU(s):                    120-239
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchac_cuda==0.2.5
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.4.dev160+g28943d36
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PHB     0-239   0-1             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PHB     0-239   0-1             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PHB     0-239   0-1             N/A
NIC0    PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

LD_LIBRARY_PATH=/root/vllm/lib/python3.10/site-packages/cv2/../../lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY


🐛 Describe the bug

TL;DR: This bug will happen when running MLA models (i.e., deepseek R1) with

  • enable prefixing caching
  • disable chunked prefill

Detailed explanation:

In #12639 , a few new variables related to chunked prefill and a new function called _compute_prefill_context are introduced in class MLACommonBackend.
The _compute_prefill_context function will be called when prefix caching is enabled. It uses the new variables related to chunked prefill (e.g., context_chunk_cu_seq_lens, context_chunk_starts, context_chunk_seq_tot, and context_chunk_max_seq_lens).
However, if chunked prefill is not enabled, those variables will be None and cause the above assertion error.

Example script:

vllm serve cognitivecomputations/DeepSeek-R1-AWQ \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --dtype float16 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --enforce-eager \
    --disable-log-requests \
    --enable-chunked-prefill false \
    --enable-prefix-caching

To reproduce the bug, send two requests to the serving engine with the same prefix.

Error logs:

Key part:

ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1568, in forward
ERROR 03-01 00:31:53 [engine.py:141]     output[:num_prefill_tokens] = self._forward_prefill(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1462, in _forward_prefill
ERROR 03-01 00:31:53 [engine.py:141]     context_output, context_lse = self._compute_prefill_context( \
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1280, in _compute_prefill_context
ERROR 03-01 00:31:53 [engine.py:141]     assert prefill_metadata.context_chunk_seq_tot is not None
ERROR 03-01 00:31:53 [engine.py:141] AssertionError
**Full error log** (just showing one of the workers, the remaining workers are the same):
ERROR 03-01 00:31:53 [engine.py:141] AssertionError()
ERROR 03-01 00:31:53 [engine.py:141] Traceback (most recent call last):
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in start
ERROR 03-01 00:31:53 [engine.py:141]     self.run_engine_loop()
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 202, in run_engine_loop
ERROR 03-01 00:31:53 [engine.py:141]     request_outputs = self.engine_step()
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 228, in engine_step
ERROR 03-01 00:31:53 [engine.py:141]     raise e
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 211, in engine_step
ERROR 03-01 00:31:53 [engine.py:141]     return self.engine.step()
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1401, in step
ERROR 03-01 00:31:53 [engine.py:141]     outputs = self.model_executor.execute_model(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 284, in execute_model
ERROR 03-01 00:31:53 [engine.py:141]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 03-01 00:31:53 [engine.py:141]     return self.driver_worker.execute_model(execute_model_req)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 03-01 00:31:53 [engine.py:141]     output = self.model_runner.execute_model(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-01 00:31:53 [engine.py:141]     return func(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1742, in execute_model
ERROR 03-01 00:31:53 [engine.py:141]     hidden_or_intermediate_states = model_executable(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 669, in forward
ERROR 03-01 00:31:53 [engine.py:141]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 03-01 00:31:53 [engine.py:141]     return self.forward(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 626, in forward
ERROR 03-01 00:31:53 [engine.py:141]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 548, in forward
ERROR 03-01 00:31:53 [engine.py:141]     hidden_states = self.self_attn(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 467, in forward
ERROR 03-01 00:31:53 [engine.py:141]     return self.mla_attn(hidden_states_or_q_c,
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 223, in forward
ERROR 03-01 00:31:53 [engine.py:141]     return torch.ops.vllm.unified_attention(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
ERROR 03-01 00:31:53 [engine.py:141]     return self._op(*args, **(kwargs or {}))
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 329, in unified_attention
ERROR 03-01 00:31:53 [engine.py:141]     return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1568, in forward
ERROR 03-01 00:31:53 [engine.py:141]     output[:num_prefill_tokens] = self._forward_prefill(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1462, in _forward_prefill
ERROR 03-01 00:31:53 [engine.py:141]     context_output, context_lse = self._compute_prefill_context( \
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1280, in _compute_prefill_context
ERROR 03-01 00:31:53 [engine.py:141]     assert prefill_metadata.context_chunk_seq_tot is not None
ERROR 03-01 00:31:53 [engine.py:141] AssertionError
ERROR 03-01 00:31:53 [serving_chat.py:665] Error in chat completion stream generator.
ERROR 03-01 00:31:53 [serving_chat.py:665] Traceback (most recent call last):
ERROR 03-01 00:31:53 [serving_chat.py:665]   File "/root/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 363, in chat_completion_stream_generator
ERROR 03-01 00:31:53 [serving_chat.py:665]     async for res in result_generator:
ERROR 03-01 00:31:53 [serving_chat.py:665]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 659, in _process_request
ERROR 03-01 00:31:53 [serving_chat.py:665]     raise request_output
ERROR 03-01 00:31:53 [serving_chat.py:665] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: AssertionError().

@LucasWilkinson @pathorn @simon-mo @tlrmchlsmth PTAL 🙏

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions