[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled"

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
INFO 03-01 00:48:13 [__init__.py:207] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 550.127.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               240
On-line CPU(s) list:                  0-239
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7J13 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   120
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             4899.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities
Virtualization:                       AMD-V
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            15 MiB (240 instances)
L1i cache:                            15 MiB (240 instances)
L2 cache:                             120 MiB (240 instances)
L3 cache:                             3.8 GiB (240 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-119
NUMA node1 CPU(s):                    120-239
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchac_cuda==0.2.5
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.4.dev160+g28943d36
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PHB     0-239   0-1             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PHB     0-239   0-1             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PHB     0-239   0-1             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PHB     0-239   0-1             N/A
NIC0    PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

LD_LIBRARY_PATH=/root/vllm/lib/python3.10/site-packages/cv2/../../lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY


```

</details>


### 🐛 Describe the bug

**TL;DR:** This bug will happen when running MLA models (i.e., deepseek R1) with 
- enable prefixing caching
- disable chunked prefill


### Detailed explanation:

In #12639 , a few new variables related to chunked prefill and a new function called `_compute_prefill_context` are introduced in class `MLACommonBackend`. 
The `_compute_prefill_context` function will be called when prefix caching is enabled. It uses the new variables related to chunked prefill (e.g., `context_chunk_cu_seq_lens`, `context_chunk_starts`, `context_chunk_seq_tot`, and `context_chunk_max_seq_lens`).
However, if chunked prefill is **not** enabled, those variables will be None and cause the above assertion error.

### Example script:
```bash
vllm serve cognitivecomputations/DeepSeek-R1-AWQ \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --dtype float16 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --enforce-eager \
    --disable-log-requests \
    --enable-chunked-prefill false \
    --enable-prefix-caching
```

To reproduce the bug, send two requests to the serving engine with the same prefix.

### Error logs:

**Key part:**
```plaintext
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1568, in forward
ERROR 03-01 00:31:53 [engine.py:141]     output[:num_prefill_tokens] = self._forward_prefill(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1462, in _forward_prefill
ERROR 03-01 00:31:53 [engine.py:141]     context_output, context_lse = self._compute_prefill_context( \
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1280, in _compute_prefill_context
ERROR 03-01 00:31:53 [engine.py:141]     assert prefill_metadata.context_chunk_seq_tot is not None
ERROR 03-01 00:31:53 [engine.py:141] AssertionError
```

<details>
<summary>**Full error log** (just showing one of the workers, the remaining workers are the same):</summary>

```plaintext
ERROR 03-01 00:31:53 [engine.py:141] AssertionError()
ERROR 03-01 00:31:53 [engine.py:141] Traceback (most recent call last):
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in start
ERROR 03-01 00:31:53 [engine.py:141]     self.run_engine_loop()
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 202, in run_engine_loop
ERROR 03-01 00:31:53 [engine.py:141]     request_outputs = self.engine_step()
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 228, in engine_step
ERROR 03-01 00:31:53 [engine.py:141]     raise e
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 211, in engine_step
ERROR 03-01 00:31:53 [engine.py:141]     return self.engine.step()
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1401, in step
ERROR 03-01 00:31:53 [engine.py:141]     outputs = self.model_executor.execute_model(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 284, in execute_model
ERROR 03-01 00:31:53 [engine.py:141]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 03-01 00:31:53 [engine.py:141]     return self.driver_worker.execute_model(execute_model_req)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 03-01 00:31:53 [engine.py:141]     output = self.model_runner.execute_model(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-01 00:31:53 [engine.py:141]     return func(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1742, in execute_model
ERROR 03-01 00:31:53 [engine.py:141]     hidden_or_intermediate_states = model_executable(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 669, in forward
ERROR 03-01 00:31:53 [engine.py:141]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 03-01 00:31:53 [engine.py:141]     return self.forward(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 626, in forward
ERROR 03-01 00:31:53 [engine.py:141]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 548, in forward
ERROR 03-01 00:31:53 [engine.py:141]     hidden_states = self.self_attn(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 467, in forward
ERROR 03-01 00:31:53 [engine.py:141]     return self.mla_attn(hidden_states_or_q_c,
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return self._call_impl(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 03-01 00:31:53 [engine.py:141]     return forward_call(*args, **kwargs)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 223, in forward
ERROR 03-01 00:31:53 [engine.py:141]     return torch.ops.vllm.unified_attention(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
ERROR 03-01 00:31:53 [engine.py:141]     return self._op(*args, **(kwargs or {}))
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 329, in unified_attention
ERROR 03-01 00:31:53 [engine.py:141]     return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1568, in forward
ERROR 03-01 00:31:53 [engine.py:141]     output[:num_prefill_tokens] = self._forward_prefill(
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1462, in _forward_prefill
ERROR 03-01 00:31:53 [engine.py:141]     context_output, context_lse = self._compute_prefill_context( \
ERROR 03-01 00:31:53 [engine.py:141]   File "/root/vllm/lib/python3.10/site-packages/vllm/attention/backends/mla/common.py", line 1280, in _compute_prefill_context
ERROR 03-01 00:31:53 [engine.py:141]     assert prefill_metadata.context_chunk_seq_tot is not None
ERROR 03-01 00:31:53 [engine.py:141] AssertionError
ERROR 03-01 00:31:53 [serving_chat.py:665] Error in chat completion stream generator.
ERROR 03-01 00:31:53 [serving_chat.py:665] Traceback (most recent call last):
ERROR 03-01 00:31:53 [serving_chat.py:665]   File "/root/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 363, in chat_completion_stream_generator
ERROR 03-01 00:31:53 [serving_chat.py:665]     async for res in result_generator:
ERROR 03-01 00:31:53 [serving_chat.py:665]   File "/root/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 659, in _process_request
ERROR 03-01 00:31:53 [serving_chat.py:665]     raise request_output
ERROR 03-01 00:31:53 [serving_chat.py:665] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: AssertionError().
```

</details>


@LucasWilkinson @pathorn @simon-mo @tlrmchlsmth PTAL 🙏

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled" #14069

Your current environment

🐛 Describe the bug

Detailed explanation:

Example script:

Error logs:

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled" #14069

Description

Your current environment

🐛 Describe the bug

Detailed explanation:

Example script:

Error logs:

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions