Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

sighingnow · 2025-01-08T13:09:11Z

This PR implements the dual-chunk flash attention, a training-free method to extend model context length (see also #6139), with sparse attention (https://github.com/microsoft/MInference) support.

This PR requires the sparse attention kernel from vllm-flash-attention. Qwen models with 1m context length support will be open-sourced in the next one or two weeks, and unit tests will be added later.

FIX #12452

github-actions · 2025-01-08T13:09:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

jacob-crux · 2025-01-09T09:58:48Z

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph.
Do you plan to fix this in the future?

mergify · 2025-01-13T12:33:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sighingnow · 2025-01-14T03:42:35Z

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

All conflicts fixed, could you please take another look? thanks!

jacob-crux · 2025-01-15T05:36:01Z

vllm/attention/backends/dual_chunk_flash_attn.py

When I try the Needle in a haystack test with qwen-7b and llama-8b(Modified code to support llama), there is a bug that produces a negative number when it is over 13k~15k.
I modified the code as below and confirmed that it works.

seq_lens_succ = ((chunk_num_curr - (chunk_num_curr - 1).clip(min=0)) * chunk_len)

mergify · 2025-01-15T05:36:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jacob-crux · 2025-01-15T05:40:44Z

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

All conflicts fixed, could you please take another look? thanks!

I tested it because I thought it was fixed, but I still have the same problem as below.
Are you saying that Cudagraph capture is possible? (enforce_eager=False)

Capturing CUDA graph shapes:   0%|                                                                                                                                                                                                               | 0/35 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/lme-storage_810/jacob/needle/NeedleInAHaystack-lme/run_needle_in_haystack.py", line 435, in <module>
[rank0]:     ht = LLMNeedleHaystackTester(
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/lme-storage_810/jacob/needle/NeedleInAHaystack-lme/run_needle_in_haystack.py", line 94, in __init__
[rank0]:     self.model_to_test = LLM(model=model_name)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/utils.py", line 1044, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/entrypoints/llm.py", line 228, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 517, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 276, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/executor/gpu_executor.py", line 83, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/worker.py", line 274, in initialize_cache
[rank0]:     self._warm_up_model()
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/worker.py", line 292, in _warm_up_model
[rank0]:     self.model_runner.capture_model(self.gpu_cache)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/model_runner.py", line 1533, in capture_model
[rank0]:     graph_runner.capture(**capture_inputs)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/model_runner.py", line 1885, in capture
[rank0]:     self.model(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 496, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/compilation/decorators.py", line 170, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 359, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 267, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:                     ^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 189, in forward
[rank0]:     attn_output = self.attn(q,
[rank0]:                   ^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/layer.py", line 185, in forward
[rank0]:     return torch.ops.vllm.unified_attention(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/layer.py", line 280, in unified_attention
[rank0]:     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 373, in forward
[rank0]:     assert decode_meta.scaling_factor is not None
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError

sighingnow · 2025-01-16T09:33:43Z

I tested it because I thought it was fixed, but I still have the same problem as below.
Are you saying that Cudagraph capture is possible? (enforce_eager=False)

The dual chunk attention doesn't support cuda graph and I have added an assertion in arg_utils.py.

When I try the Needle in a haystack test with qwen-7b and llama-8b(Modified code to support llama), there is a bug that produces a negative number when it is over 13k~15k.

It is indeed a bug introduced during preparing this PR, fixed. Thanks!

sighingnow · 2025-01-19T09:40:13Z

Rebase against main.

Hi @youkaichao @simon-mo @WoosukKwon Do you folks think if there are still things that need to be improved in this pull request?

Thanks!

tlrmchlsmth

Spotted a few bits ofcommented out code that look like debug cruft or are otherwise mysterious. Could you clean those up and any other similar spots?

csrc/attention/vertical_slash_index.cu

mergify · 2025-01-20T21:03:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

examples/offline_inference_qwen_1m.py

vllm/attention/backends/flash_attn.py

vllm/engine/arg_utils.py

vllm/attention/backends/dual_chunk_flash_attn.py

vllm/attention/backends/flash_attn.py

vllm/attention/backends/dual_chunk_flash_attn.py

vllm/attention/backends/xformers.py

LucasWilkinson · 2025-01-20T18:42:38Z

vllm/model_executor/layers/rotary_embedding.py

+        qc_freqs = torch.einsum("i,j -> ij", qc_t, inv_freq)
+        k_freqs = torch.einsum("i,j -> ij", k_t, inv_freq)
+        qc_no_clamp_freqs = torch.einsum("i,j -> ij", qc_no_clamp_t, inv_freq)
+        q_inter_freqs = torch.einsum("i,j -> ij", q_inter_t, inv_freq)


nit: I think these einsum's are still slow on cuda than (a * b).sum(-1), not on the hot path though so not critical

pytorch/pytorch#101249

ran bench_einsum.py from that issue on an H100 and got:

python einsum_bench.py [------------------------------------- -------------------------------------] | mul/sum | torch.einsum | numpy.einsum 1 threads: ------------------------------------------------------------------- Nc,Nc->N cpu (1048576, 2) | 5000 | 3100 | 4000 Nc,Nc->N cuda (1048576, 2) | 20 | 747 | 3300 Times are in microseconds (us).

vllm/worker/model_runner.py

LucasWilkinson · 2025-01-20T22:37:48Z

vllm/attention/layer.py

+            logits_soft_cap, attn_type, **{
+                "dual_chunk_attention_config": dual_chunk_attention_config,
+                "prefix": prefix,
+            } if dual_chunk_attention_config is not None else {})


I feel like this messy, I think we should maybe do something like:

def __init__(..., **extra_attn_kwargs): self.impl = impl_cls(..., **extra_attn_kwargs)

the challenge here is prefix would not be captured by extra_attn_kwargs but is only (currently) used by DualChunkFlashAttentionImpl. I do think it would be less messy though to do this any make prefix a standard arg for attention impls, given that it is pretty generic. Thoughts @WoosukKwon

LucasWilkinson · 2025-01-20T22:44:49Z

vllm/attention/layer.py

+        if self.dual_chunk_attention_config:
+            assert query_succ_and_inter is not None
+            dca_kwargs = {
+                "query_succ": query_succ_and_inter[0],
+                "query_inter": query_succ_and_inter[1],
+                "query_succ_critical": query_succ_and_inter[2],
+                "query_inter_critical": query_succ_and_inter[3],
+            } if query_succ_and_inter else {}
+        else:
+            dca_kwargs = {}
+


I think we should try hard to see if there is cleaner way of passing these, maybe they can be bundled into a single q tensor that get reinterpreted as components via a combination of slicing and .view calls in the attn impl?

I would take a try to see if it can be simplified.

CMakeLists.txt

…h sparse attention support (vllm-project#11844) Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

ExtReMLapin · 2025-05-25T09:56:04Z

So If I understand correctly, now Qwen2.5-1M actually uses the correct attention mechanism and VRAM should be lowered and prompt processing faster, right ?

exceedzhang · 2025-06-01T02:18:14Z

I tested Qwen/Qwen2.5-7B-Instruct-1M using DualChunkFlashAttention backend.

It startup well, but not work well. @sighingnow

ubuntu-vllm-openai-1 | INFO 05-31 19:13:07 [logger.py:42] Received request cmpl-77d91882816c4f748e2023c93449f62d-0: prompt: 'Once upon a time', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=1, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [12522, 5193, 264, 882], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
ubuntu-vllm-openai-1 | INFO 05-31 19:13:07 [engine.py:316] Added request cmpl-77d91882816c4f748e2023c93449f62d-0.
ubuntu-vllm-openai-1 | INFO: 172.18.0.1:46884 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] AssertionError('seqused_k must be provided if block_table is provided')
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] Traceback (most recent call last):
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 162, in start
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] self.run_engine_loop()
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 225, in run_engine_loop
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] request_outputs = self.engine_step()
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 251, in engine_step
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] raise e
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 234, in engine_step
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self.engine.step()
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1393, in step
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] outputs = self.model_executor.execute_model(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 299, in execute_model
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] driver_outputs = self._driver_execute_model(execute_model_req)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self.driver_worker.execute_model(execute_model_req)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] output = self.model_runner.execute_model(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return func(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1843, in execute_model
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_or_intermediate_states = model_executable(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 481, in forward
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in call
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self.forward(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 358, in forward
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_states, residual = layer(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 257, in forward
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] hidden_states = self.self_attn(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 187, in forward
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] attn_output = self.attn(q, k, v)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._call_impl(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return forward_call(*args, **kwargs)
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 237, in forward
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return torch.ops.vllm.unified_attention(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] return self._op(*args, **(kwargs or {}))
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 386, in unified_attention
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] output = self.impl.forward(self, query, key, value, kv_cache,
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 493, in forward
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] self._dual_chunk_flash_attn_prefill(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 673, in _dual_chunk_flash_attn_prefill
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] current_out = self._dual_chunk_flash_attn_prefill_func(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 1055, in _dual_chunk_flash_attn_prefill_func
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] flash_result = self._do_flash_attn(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 1207, in _do_flash_attn
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] output, softmax_lse = flash_attn_varlen_func(
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 204, in flash_attn_varlen_func
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] assert block_table is None or seqused_k is not None,
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ubuntu-vllm-openai-1 | ERROR 05-31 19:13:08 [engine.py:164] AssertionError: seqused_k must be provided if block_table is provided

ExtReMLapin · 2025-06-04T14:05:49Z

Exact same issue as above

ExtReMLapin · 2025-06-05T04:17:01Z

PR #19084 Fixes this issue.

When working with contexts of 70k, with the model loaded + the context it uses something like 30Gb of vram, but during inference it goes up to 35-37gb of vram then back down to 30Gb.

I'm guessing it's expected but is there some kind of way to preallocating this memory ? Because if you let VLLM allocate 80% of the vram and it tries to "eat" more VRAM, well obviously it will OOM

Edit :

FP8 model quantization is not working
--pipeline_parallel_size is not working
--tensor_parallel_size is not working

mklasby · 2025-06-05T11:40:43Z

@ExtReMLapin

When working with contexts of 70k, with the model loaded + the context it uses something like 30Gb of vram, but during inference it goes up to 35-37gb of vram then back down to 30Gb.

The qk estimate softmax has high memory overhead: https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/dual_chunk_flash_attn.py#L834

During start-up profiling, DCA specifically routes to flash-attention instead of the DCA sparse prefill function:
https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/dual_chunk_flash_attn.py#L474

In principle, there's no reason to use flash-attention during profiling from what I can see. So having that branch instead call the sparse attention branch should at least identify the OOM during profiling.

ExtReMLapin · 2025-06-06T05:21:32Z

Not a "blog" but it can help people working with it, so far we got much better results with Qwen 2.5 7b 1m than with nemotron 4M from nvidia.

However beside the issues states before with quantization and gpu splitting, we did not manage either to do batching/parallel processing

ExtReMLapin · 2025-06-10T12:19:59Z

Quantization support has been added into #19420

Could not test kv cache quantization because this attention mechanism is based on Flash attention

ExtReMLapin · 2025-07-04T11:19:18Z

A bug appeared between commit bbfa0c6 and b9a1791 that makes vllm serve crash when using DCA

downtracking it...

edit : ref #20484 20484

sighingnow · 2025-07-09T02:01:20Z

Thanks for reporting @ExtReMLapin @exceedzhang . Will investigate this week.

ExtReMLapin · 2025-07-09T02:05:05Z

It’s already fixed and a PR has been merged.

ExtReMLapin · 2025-07-09T05:34:58Z

@sighingnow @exceedzhang thank for your contributions, it's mostly those PR that needs a review :

Priority because crash fix : #19084

FP8 quantization support #19420

exceedzhang · 2025-07-10T14:19:09Z

FP8 quantization support #19420

Thank you for your development work; I've tested it, and the feature functions correctly. However, I've noticed a performance drop after enabling FP8 quantization.

Here are the performance test results using four RTX 4090 24GB GPUs.

FP8 quantization

ExtReMLapin · 2025-07-10T14:32:40Z

I agree with you, we expect better performance with FP8 because of lower memory bottleneck.

I also have another update waiting under the hood on this branch which should improve performances (packed torch operations) :
https://github.com/ExtReMLapin/vllm/tree/faster_dca but I didn't run enough tests on it.

Considering the slow downs ... isn't that the fault of the flash attention implementation considering the very little changes I did ?

exceedzhang · 2025-07-10T15:26:56Z

I agree with you, we expect better performance with FP8 because of lower memory bottleneck.

I also have another update waiting under the hood on this branch which should improve performances (packed torch operations) : https://github.com/ExtReMLapin/vllm/tree/faster_dca but I didn't run enough tests on it.

Considering the slow downs ... isn't that the fault of the flash attention implementation considering the very little changes I did ?

@ExtReMLapin
Thanks for optimizing the code, but I've tested it and the performance difference compared to the previous version isn't significant! I conducted stress tests on Qwen2.5-7B-1M using an RTX 4090 24G 4-card GPU server.

ExtReMLapin · 2025-07-10T16:17:09Z

Got it, not merging this performance branch into the FP8 branch then, it's not worth the risk of breaking something !

Again at the office we really appreciate the effort spent on releasing those models.

We can a lot of tests, including other models claiming to have long context :

Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct : Not following instructions correctly
gradientai/Llama-3-8B-Instruct-262k following instructions but struggles to speak anything else than english
01-ai/Yi-9B-200K byebye template chat being broken
phi-3 128k not enough vram for 128k context
Menlo/Jan-nano-128k really meh result, not following instructions correctly
aws-prototyping/MegaBeam-Mistral-7B-512k same issues as above

but this is the ONLY model actually following instructions on very long context and that can be ran easily (not insane resources).

Looking forward to see more models like this in the future !

ExtReMLapin · 2025-07-11T06:39:22Z

Well I'm not sure exactly what happened but reading my PR code again and again , it should only affect KV cache quantization and not model quantization, now checking again without my pr , quantization seems to work, without my changes, making this comment sounds like i'm insane #19084 (comment)

ExtReMLapin · 2025-07-11T08:48:37Z

Ran more tests :

FP8 works no need for a PR
tensor parallel works
⚠️ DCA/Sparse attention is broken on Blackwell, fixed by this PR : Sparse attention : Generalize arch checks for A100 and above flash-attention#73
⚠️ Memory grows instead of being preallocated which causes an OOM error with default allocation percentage
- tested on RTX 5090s : CUDA_VISIBLE_DEVICES=0,1,2 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN vllm serve Qwen/Qwen2.5-7B-Instruct-1M --max-model-len 240000 --max-num-seqs 2 --port 2483 --enforce-eager --enable-server-load-tracking --disable-log-requests --max-num-seqs 50 --quantization fp8 --tensor-parallel-size 2
- no issue with vram limitation at 80% --gpu-memory-utilization 0.8 and context of 240000 filled at 90% in one query

ExtReMLapin · 2025-08-08T10:40:25Z

Support added to latest 30B-A3B
https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507/commit/3ffd1f50b179e643d839c86df9ffbbefcb0d5018

ExtReMLapin · 2025-09-25T05:34:42Z

Any news on V1 implementation ?

ExtReMLapin · 2025-10-09T08:47:31Z

@sighingnow implementation has been removed in #25351 because attention backend doesn't have any V1 implementation, is there any plan to release a V1 compatible implementation ?

liuyumoye · 2025-10-14T03:41:09Z

I want to try sparse attention on deepseek. Can I use sparse attention alone without dual_chunk_attention? I have my own sparse attn config.

sighingnow requested a review from tlrmchlsmth as a code owner January 8, 2025 13:09

mergify bot added the ci/build label Jan 8, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch 2 times, most recently from 82b5a4c to 4c4a33e Compare January 9, 2025 06:17

minminsun mentioned this pull request Jan 12, 2025

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) vllm-project/flash-attention#33

Merged

sighingnow force-pushed the dev/dual-chunk-attn branch from 4c4a33e to 6b7c49e Compare January 13, 2025 12:32

mergify bot added the needs-rebase label Jan 13, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from 6b7c49e to 35aac26 Compare January 13, 2025 12:52

mergify bot removed the needs-rebase label Jan 13, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from 35aac26 to 91d5476 Compare January 13, 2025 16:27

jacob-crux reviewed Jan 15, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 15, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from 91d5476 to c8781cd Compare January 16, 2025 09:32

mergify bot removed the needs-rebase label Jan 16, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from c8781cd to 8648b1e Compare January 19, 2025 09:38

tlrmchlsmth reviewed Jan 20, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 20, 2025

mgoin reviewed Jan 20, 2025

View reviewed changes

examples/offline_inference_qwen_1m.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jan 20, 2025

View reviewed changes

LucasWilkinson reviewed Jan 20, 2025

View reviewed changes

LucasWilkinson requested changes Jan 23, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

sighingnow deleted the dev/dual-chunk-attn branch May 16, 2025 08:34

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

Implements dual-chunk-flash-attn backend for dual chunk attention wit…

c6f4bee

…h sparse attention support (vllm-project#11844) Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

iofu728 mentioned this pull request May 30, 2025

Feature(MInference): support SGLang and vLLM vertical_and_slash flash attention and index kernels microsoft/MInference#153

Merged

4 tasks

GITHUBear mentioned this pull request Jun 3, 2025

[Bugfix]: Fix DualChunkFlashAttention for short sequences #19084

Closed

yanghui1-arch mentioned this pull request Jul 11, 2025

[Bug]: TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'layer_idx' while loading a fine-tuned qwen2.5 model #20634

Closed

1 task

sighingnow mentioned this pull request Jul 22, 2025

[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. #21364

Merged

4 tasks

LucasWilkinson mentioned this pull request Jul 29, 2025

[Feature]: [V1][DCA] Support DCA in V1 #21838

Open

1 task

Uh oh!

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

Uh oh!

Conversation

sighingnow commented Jan 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 8, 2025

Uh oh!

jacob-crux commented Jan 9, 2025

Uh oh!

mergify bot commented Jan 13, 2025

Uh oh!

sighingnow commented Jan 14, 2025

Uh oh!

jacob-crux Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 15, 2025

Uh oh!

jacob-crux commented Jan 15, 2025

Uh oh!

sighingnow commented Jan 16, 2025

Uh oh!

sighingnow commented Jan 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

sighingnow Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ExtReMLapin commented May 25, 2025

Uh oh!

exceedzhang commented Jun 1, 2025

Uh oh!

ExtReMLapin commented Jun 4, 2025

Uh oh!

ExtReMLapin commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mklasby commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Jun 6, 2025

Uh oh!

ExtReMLapin commented Jun 10, 2025

sighingnow commented Jan 8, 2025 •

edited by github-actions bot

Loading

sighingnow commented Jan 19, 2025 •

edited

Loading

ExtReMLapin commented Jun 5, 2025 •

edited

Loading

mklasby commented Jun 5, 2025 •

edited

Loading

ExtReMLapin commented Jul 4, 2025 •

edited

Loading

liuyumoye commented Oct 14, 2025 •

edited

Loading