Closed as not planned
Description
In theory, MQA/GQA can reduce memory bandwidth for reading KV cache and enable using TensorCore for the dot products in attention mechanism. However, this benefit can be only realized when using optimized kernels that vLLM does not have at the moment.
- For prefill, vLLM explicitly expands the incoming keys and values before running the attention op:
vllm/vllm/model_executor/layers/attention.py
Lines 121 to 128 in e5452dd
- For decode, vLLM's current paged attention kernel also does not leverage the benefits of MQA/GQA. To enjoy the benefit, we need to either significantly rewrite the paged attention kernel, or modify the FlashAttention kernel to support paged KV cache.