Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] Implement Cascade Attention #11635

Merged
merged 23 commits into from
Jan 1, 2025
Merged
Prev Previous commit
Next Next commit
Minor
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
  • Loading branch information
WoosukKwon committed Dec 30, 2024
commit 4faac41e4f61b1e3292b71c75d1fd5fad2ab4668
2 changes: 1 addition & 1 deletion vllm/v1/attention/backends/flash_attn.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ class FlashAttentionMetadata:
block_table: torch.Tensor
slot_mapping: torch.Tensor

# For cascade inference.
# For cascade attention.
use_cascade: bool
common_prefix_len: int
cu_prefix_query_lens: Optional[torch.Tensor]
Expand Down
2 changes: 2 additions & 0 deletions vllm/v1/core/kv_cache_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,8 @@ def get_num_common_prefix_blocks(
blocks = self.req_to_blocks[request.request_id]
num_common_blocks = 0
for block in blocks:
# FIXME(woosuk): For some reason, sometimes the ref_cnt is greater
# than the number of running requests. DEBUG this.
if block.ref_cnt >= num_requests:
num_common_blocks += 1
else:
Expand Down
Loading