-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
🐛 Describe the bug
KVCacheManager.get_num_common_prefix_blocks is used to identify the longest common prefix of requests in the current batch (actually all RUNNING requests), which in turn is used to determine whether to use cascade attention.
The logic currently assumes each ref count on a block corresponds to a running request, however with async kv offloading, ref counts can be held after a request has completed and is no longer in running state.
This means the logic could incorrectly identify a common prefix that isn't actually shared by all batch constituents.
With forthcoming changes to the KVConnector API there may be other situations that connectors hold their own reference to blocks.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working