You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like vLLM could directly import the PagedAttention kernels from FlashInfer to support GQA. "For batch GQA decoding attention, FlashInfer w/ Tensor Cores is 3x faster than vLLM PagaAttention when batch_size=64." @WoosukKwon
It looks like vLLM could directly import the PagedAttention kernels from FlashInfer to support GQA. "For batch GQA decoding attention, FlashInfer w/ Tensor Cores is 3x faster than vLLM PagaAttention when batch_size=64." @WoosukKwon
https://github.com/flashinfer-ai/flashinfer/
https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
The text was updated successfully, but these errors were encountered: