Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Refactor Attention Take 2 #3462

Merged
merged 88 commits into from
Mar 25, 2024
Merged

[Core] Refactor Attention Take 2 #3462

merged 88 commits into from
Mar 25, 2024

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Mar 18, 2024

This PR is the second attempt to modularize the attention backends. The main goal of this PR is to hide any backend-specific attention implementation details from the main logic. This refactoring will greatly help introduce new backends, particularly the FlashInfer backend, which requires a different KV cache layout and input data structures from our current attention backends. NOTE: Since this PR just re-organizes the code, it shouldn't affect the functionality or performance of vLLM.

This PR defines three main classes for each attention backend: AttentionBackend, AttentionMetadata, and AttentionImpl. AttentionBackend (which can be queried by get_attn_backend) is a static class that defines the KV cache layout, swapping ops, and also works as a dispatcher for AttentionMetadata and AttentionImpl. AttentionMetadata is same as the current InputMetadata, but it can be different for other backends added in the future. Finally, AttentionImpl is the actual implementation of the attention operator.

While ultimately I'd like to move part of ModelRunner's prepare_inptus to AttentionBackend, I didn't do it to reduce the size of the PR.

@WoosukKwon WoosukKwon enabled auto-merge (squash) March 25, 2024 04:30
@WoosukKwon WoosukKwon merged commit 925f333 into main Mar 25, 2024
31 of 32 checks passed
@WoosukKwon WoosukKwon deleted the flashinfer-take3 branch March 25, 2024 05:01
@rkooo567
Copy link
Collaborator

Really excited to see this PR being merged! Btw, if you have limited bandwidth, I am willing to help writing tests that combine attention backend + prepare_inputs...!

@richardliaw
Copy link
Collaborator

Great to see this merged! Thanks for the work here.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 27, 2024

This PR is the second attempt to modularize the attention backends. The main goal of this PR is to hide any backend-specific attention implementation details from the main logic. This refactoring will greatly help introduce new backends, particularly the FlashInfer backend, which requires a different KV cache layout and input data structures from our current attention backends. NOTE: Since this PR just re-organizes the code, it shouldn't affect the functionality or performance of vLLM.

This PR defines three main classes for each attention backend: AttentionBackend, AttentionMetadata, and AttentionImpl. AttentionBackend (which can be queried by get_attn_backend) is a static class that defines the KV cache layout, swapping ops, and also works as a dispatcher for AttentionMetadata and AttentionImpl. AttentionMetadata is same as the current InputMetadata, but it can be different for other backends added in the future. Finally, AttentionImpl is the actual implementation of the attention operator.

While ultimately I'd like to move part of ModelRunner's prepare_inptus to AttentionBackend, I didn't do it to reduce the size of the PR.

Hi @WoosukKwon After integrating the FlashInfer backend, how much improvement in overall throughput performance is expected? Thanks.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 27, 2024

For example, in the current Llama2 7b or 13b model, the time taken for attention calculation accounts for about 1/3 of the total time. If FlashInfer is integrated and assuming a 30% improvement on the kernel compared to the current implementation, would there be an overall throughput increase of close to 10%?

tlrmchlsmth added a commit to neuralmagic/nm-vllm that referenced this pull request Apr 11, 2024
)

Looks like this directory was moved in
vllm-project#3462, but the old directory
has been hanging around for a while in our repo.
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants