Closed
Description
It looks like vLLM could directly import the PagedAttention kernels from FlashInfer to support GQA. "For batch GQA decoding attention, FlashInfer w/ Tensor Cores is 3x faster than vLLM PagaAttention when batch_size=64." @WoosukKwon
https://github.com/flashinfer-ai/flashinfer/
https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
Metadata
Metadata
Assignees
Labels
No labels