- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
fix(v1/kv_cache): resolve async KV transfer bug in cascade attention #23485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a critical bug in cascade attention related to asynchronous KV transfer by replacing the unreliable ref_cnt-based logic with explicit tracking of running requests. The changes are well-contained and logically sound. My review includes one suggestion to optimize the performance of the new common prefix block calculation, which could be a bottleneck in scenarios with many concurrent requests.
f3f7105    to
    91e3694      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even after block in self.req_to_blocks[req_id] is fixed, I'm still concern about the performance when all requests are sharing a very long prefix. The time complexity is num_requests x num_blocks_per_request. What about passing in the requests that are not running but are during kv transfer?
542d108    to
    8adac43      
    Compare
  
    | 
           Hi @heheda12345 @njhill The time complexity of the new code is O(RxB) now, which was O(RxB²) in the previous iteration. I have one caching based implementation as well in mind which will bring down the complexity to O(1) best case and O(RxB) worst case. But that makes the code a little complex for this module hence I did not want to push that version without someone's approval. PTAL if this is fine or if we need to improve this further? Thanks!  | 
    
| 
           My example code is O((num_transfering_request+1) * num_common_blocks). It should be much faster than num_running_request * num_common_blocks for short requests.  | 
    
8adac43    to
    4d368d4      
    Compare
  
    | 
           Hi @heheda12345 I did the changes your way this time and have pushed it as well. Please take a look, Thanks!  | 
    
64ed09d    to
    0d66b57      
    Compare
  
    | 
           Hi @heheda12345 just a gentle reminder to please take a look and approve if everything is right. Thanks!  | 
    
| 
           @ayushsatyam146 Hi, can you help to update this PR?  | 
    
| 
           Hi @heheda12345 sorry I got sick this week and couldn't work on this. But I am good now and will update this soon, Thanks for the patience.  | 
    
0d66b57    to
    b11852c      
    Compare
  
    b11852c    to
    606c471      
    Compare
  
    | 
           @heheda12345, I tried to address all your concerns. Can you please take a look now, Thanks!  | 
    
4d031f2    to
    fd1c710      
    Compare
  
    | 
           This pull request has merge conflicts that must be resolved before it can be  | 
    
fd1c710    to
    bd0e85c      
    Compare
  
    | 
           Hi @heheda12345 I resolved the merge conflicts on this and also included the changes suggested by you. Please take a look, Thanks!  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation looks great!
| 
           @codex review  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
 - Mark a draft as ready
 - Comment "@codex review".
 
If Codex has suggestions, it will comment; otherwise it will react with 👍.
bd0e85c    to
    e105a65      
    Compare
  
    e105a65    to
    912667d      
    Compare
  
    * Replace ref_cnt-based common prefix detection with running request tracking * Update get_num_common_prefix_blocks() to accept running_request_ids set * Fix FullAttentionManager to count actual references from running requests * Prevent incorrect cascade attention when async KV offloading delays cleanup This resolves a bug where completed requests with pending async transfers still contributed to ref_cnt, causing incorrect cascade attention decisions. Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
912667d    to
    d3405fd      
    Compare
  
    | 
           Hi @heheda12345, I went through this approach, and apart from some occasional conservative handling of cascade attention, it looks good overall. I’ve implemented it as well — please take a look when you get a chance. Thanks!  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think this solution is clean.
…llm-project#23485) Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
…llm-project#23485) Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#23485) Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…llm-project#23485) Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
…llm-project#23485) Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with vllm-project/vllm@17c540a 1. refactor deepseek to the latest code arch as of vllm-project/vllm@17c540a 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by vllm-project/vllm#25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by vllm-project/vllm#26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by vllm-project/vllm#23485 - Fix `MLAAttention` import,caused by vllm-project/vllm#25103 - Fix `SharedFusedMoE` import, caused by vllm-project/vllm#26145 - Fix `LazyLoader` improt, caused by vllm-project/vllm#27022 - Fix `vllm.utils.swap_dict_values` improt, caused by vllm-project/vllm#26990 - Fix `Backend` enum import, caused by vllm-project/vllm#25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by vllm-project/vllm#26355 - Fix fused_moe ops, caused by vllm-project/vllm#24097 - Fix bert model because of `inputs_embeds`, caused by vllm-project/vllm#25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by vllm-project/vllm#24172 - Fix `splitting_ops` changes introduced by vllm-project/vllm#25845 - Fix multi-modality changes introduced by vllm-project/vllm#16229 - Fix lora bias dropping issue introduced by vllm-project/vllm#25807 - Fix structured ouput break introduced by vllm-project/vllm#26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>
…llm-project#23485) Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Solves #23130. This change fixes a critical bug in vLLM's cascade attention optimization in the V1 arch. The bug is in
get_num_common_prefix_blocks(), which determines how many KV cache blocks are shared among all currently running requests to enable cascade attention optimizations.Changes made