Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

Closed
casper-hansen opened this issue Feb 5, 2024 · 2 comments
Closed

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

casper-hansen opened this issue Feb 5, 2024 · 2 comments

Comments

@casper-hansen
Copy link
Contributor

It looks like vLLM could directly import the PagedAttention kernels from FlashInfer to support GQA. "For batch GQA decoding attention, FlashInfer w/ Tensor Cores is 3x faster than vLLM PagaAttention when batch_size=64." @WoosukKwon

https://github.com/flashinfer-ai/flashinfer/
https://flashinfer.ai/2024/02/02/introduce-flashinfer.html

image

@zhuohan123
Copy link
Member

We are talking to the FlashInfer team and working on merging it with vLLM!

@sumo43
Copy link

sumo43 commented Feb 5, 2024

Made a draft PR implementing flashinfer. Would love to help merge it. #2772

@linear linear bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 6, 2024
@simon-mo simon-mo reopened this Aug 6, 2024
@simon-mo simon-mo closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants