-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Refactor Attention Take 2 #3462
Conversation
Really excited to see this PR being merged! Btw, if you have limited bandwidth, I am willing to help writing tests that combine attention backend + prepare_inputs...! |
Great to see this merged! Thanks for the work here. |
Hi @WoosukKwon After integrating the FlashInfer backend, how much improvement in overall throughput performance is expected? Thanks. |
For example, in the current Llama2 7b or 13b model, the time taken for attention calculation accounts for about 1/3 of the total time. If FlashInfer is integrated and assuming a 30% improvement on the kernel compared to the current implementation, would there be an overall throughput increase of close to 10%? |
) Looks like this directory was moved in vllm-project#3462, but the old directory has been hanging around for a while in our repo.
This PR is the second attempt to modularize the attention backends. The main goal of this PR is to hide any backend-specific attention implementation details from the main logic. This refactoring will greatly help introduce new backends, particularly the FlashInfer backend, which requires a different KV cache layout and input data structures from our current attention backends. NOTE: Since this PR just re-organizes the code, it shouldn't affect the functionality or performance of vLLM.
This PR defines three main classes for each attention backend:
AttentionBackend
,AttentionMetadata
, andAttentionImpl
.AttentionBackend
(which can be queried byget_attn_backend
) is a static class that defines the KV cache layout, swapping ops, and also works as a dispatcher forAttentionMetadata
andAttentionImpl
.AttentionMetadata
is same as the currentInputMetadata
, but it can be different for other backends added in the future. Finally,AttentionImpl
is the actual implementation of the attention operator.While ultimately I'd like to move part of ModelRunner's
prepare_inptus
toAttentionBackend
, I didn't do it to reduce the size of the PR.