[Performance] Support MQA/GQA in prefill stage by using FlashAttention #2401

zhaoyang-star · 2024-01-10T08:05:34Z

As shown in #1880, xformers has not supported MQA/GQA yet. So key and value need to be extended before calculating softmax(Q @ K^T * softmax_scale) @ V. While FlashAttention has supported MQA/GQA and has supported Turing, Ampere, Ada, or Hopper GPU. But note that FA has some limits (1. head size up to 256. datatype fp16 and bf16.)

So for prefill, I replaced xformers with FlashAttention when FA can handle. It will fallback to xformers when head size > 256 or dtype is float. ~~Benchmark shows the speedup is 3x~~

Using CodeLLaMA-34B config (num_query_heads=64, num_key_value_heads=8, head_size=128)
Tested on A100-40GB
The latency is the time for caculating softmax(Q @ K^T * softmax_scale) @ V
The benchmark below could be reproduced by running benchmark_multi_query_kv_attention.py

The benchmark below is invalid as I misused torch.repeat_interleave when benchmarking original case.

Test id	Batchsize	Prompt length	Original xformers (us)	FA (us)	Speedup (Original / FA)
1	1	1024	6.875	1.625	4.2
2	10	1024	36.965	11.518	3.2
3	100	1024	364.861	126.109	2.9

Yard1 · 2024-01-10T20:02:30Z

Hey, I cannot replicate the benchmark results wrt current implementation. Here is what I am getting:

--num-query-heads 64 --num-kv-heads 8 --head-size 128 --seq-len 1024 --batch-size 1 : 2.715 us
--num-query-heads 64 --num-kv-heads 8 --head-size 128 --seq-len 1024 --batch-size 1, modified to use non-blocking repeat_interleave (see vllm attention.py) : 1.876 us
--num-query-heads 64 --num-kv-heads 8 --head-size 128 --seq-len 1024 --batch-size 1 --use-flash-attn: 1.722 us
--num-query-heads 64 --num-kv-heads 8 --head-size 128 --seq-len 1024 --batch-size 10 : 20.311 us
--num-query-heads 64 --num-kv-heads 8 --head-size 128 --seq-len 1024 --batch-size 10, modified to use non-blocking repeat_interleave (see vllm attention.py) : 12.439 us
--num-query-heads 64 --num-kv-heads 8 --head-size 128 --seq-len 1024 --batch-size 1 --use-flash-attn: 12.166 us

As you can see, the flash attention result is close to what you have reported, but the current implementation result is much closer to flash attention, especially after the benchmark code has been modified to match what's actually in vllm code.

Here's the modification:

                key_expanded = key
                value_expanded = value
                query_expanded = query
                num_queries_per_kv = num_query_heads // num_kv_heads
                if num_queries_per_kv > 1:
                    # Handle MQA and GQA
                    query_expanded = query_expanded.view(query_expanded.shape[0], num_kv_heads,
                                    num_queries_per_kv, query_expanded.shape[-1])
                    key_expanded = key[:, :,
                            None, :].expand(key.shape[0], num_kv_heads,
                                            num_queries_per_kv, key.shape[-1])
                    value_expanded = value[:, :,
                                None, :].expand(value.shape[0], num_kv_heads,
                                                num_queries_per_kv,
                                                value.shape[-1])
                attn_bias = BlockDiagonalCausalMask.from_seqlens(seq_lens)
                output = xops.memory_efficient_attention_forward(
                    query_expanded.unsqueeze(0),
                    key_expanded.unsqueeze(0),
                    value_expanded.unsqueeze(0),
                    attn_bias=attn_bias,
                    p=0.0,
                    scale=scale,
                )

A100-40GB, xformers==0.0.23.post1 flash_attn==2.4.2.

zhaoyang-star · 2024-01-11T01:26:58Z

@Yard1 Thanks for your info. You are right. I should use expand same as in attention.py, rather than torch.repeat_interleave.

Updated benchmark results are below. Xformers is close to FA.

Test id	Batchsize	Prompt length	Original xformers same in attention.py (us)	FA (us)	Speedup (Original / FA)
1	1	1024	2.010	1.625	1.24
2	10	1024	13.188	11.518	1.14
3	100	1024	140.809	126.109	1.12

xformers==0.0.22, flash-attn==2.4.2.

cc @casper-hansen @beginlner

casper-hansen · 2024-01-11T07:48:27Z

For MQA/GQA, it should mainly see a speedup during decoding, although this looks good for a start. Can we measure throughput difference?

zhaoyang-star · 2024-01-11T11:01:40Z

For MQA/GQA, it should mainly see a speedup during decoding, although this looks good for a start. Can we measure throughput difference?

I used starcoder which is a MQA model and the throughput is mainly close to the original version.

Lvjinhong · 2024-01-11T14:20:29Z

For MQA/GQA, it should mainly see a speedup during decoding, although this looks good for a start. Can we measure throughput difference?

I used starcoder which is a MQA model and the throughput is mainly close to the original version.

Hi, regarding the llama structure, does it significantly improve throughput performance?

sh1ng · 2024-01-11T20:42:27Z

https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html

Expanding a tensor does not allocate new memory, but only creates a new view on the existing tensor where a dimension of size one is expanded to a larger size by setting the stride to 0. Any dimension of size 1 can be expanded to an arbitrary value without allocating new memory.

I guess it explains the difference.

zhaoyang-star · 2024-01-12T00:58:56Z

For MQA/GQA, it should mainly see a speedup during decoding, although this looks good for a start. Can we measure throughput difference?

I used starcoder which is a MQA model and the throughput is mainly close to the original version.

Hi, regarding the llama structure, does it significantly improve throughput performance?

I guess the result will be similar with starcoder as starcoder is MQA and llama2-34B is GQA. Besides, llama2-7B and 13B is MHA so will not gain any speedup.

Lvjinhong · 2024-01-13T03:15:28Z

When I tested this PR with the llama2 70B, the throughput did not improve. I used the asyncLLMengine server along with 4*A800 80G PCIE.

zhaoyang-star · 2024-01-13T14:18:26Z

When I tested this PR with the llama2 70B, the throughput did not improve. I used the asyncLLMengine server along with 4*A800 80G PCIE.

Because the attention caculation could only achieve 1.1~1.2 speedup, the e2e speedup is hard to benchmark.

casper-hansen · 2024-01-15T23:02:30Z

When I tested this PR with the llama2 70B, the throughput did not improve. I used the asyncLLMengine server along with 4*A800 80G PCIE.

Because the attention caculation could only achieve 1.1~1.2 speedup, the e2e speedup is hard to benchmark.

Largest speed up will probably be seen during decoding.

sighingnow · 2024-02-23T12:38:43Z

See also #3010 which introduces the flash_attn_with_kvcache (available for paged kv-cache since flash-attn>=2.5.0).

WoosukKwon · 2024-08-01T19:26:37Z

Closing as it's already implemented. Thanks for submitting the PR. Learned a lot from it!

zhaoyang-star added 3 commits January 10, 2024 03:11

unittest need to refactor

53dc455

update unittest

b80af6c

remove temp unittest

09ae0bc

zhaoyang-star marked this pull request as draft January 11, 2024 00:58

replace torch.repeat_interleave with expand to avoid cpu blocking

763d295

zhaoyang-star marked this pull request as ready for review January 15, 2024 00:56

WoosukKwon closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Support MQA/GQA in prefill stage by using FlashAttention #2401

[Performance] Support MQA/GQA in prefill stage by using FlashAttention #2401

zhaoyang-star commented Jan 10, 2024 •

edited

Loading

Yard1 commented Jan 10, 2024

zhaoyang-star commented Jan 11, 2024 •

edited

Loading

casper-hansen commented Jan 11, 2024

zhaoyang-star commented Jan 11, 2024

Lvjinhong commented Jan 11, 2024

sh1ng commented Jan 11, 2024

zhaoyang-star commented Jan 12, 2024

Lvjinhong commented Jan 13, 2024

zhaoyang-star commented Jan 13, 2024 •

edited

Loading

casper-hansen commented Jan 15, 2024

sighingnow commented Feb 23, 2024

WoosukKwon commented Aug 1, 2024

[Performance] Support MQA/GQA in prefill stage by using FlashAttention #2401

[Performance] Support MQA/GQA in prefill stage by using FlashAttention #2401

Conversation

zhaoyang-star commented Jan 10, 2024 • edited Loading

Yard1 commented Jan 10, 2024

zhaoyang-star commented Jan 11, 2024 • edited Loading

casper-hansen commented Jan 11, 2024

zhaoyang-star commented Jan 11, 2024

Lvjinhong commented Jan 11, 2024

sh1ng commented Jan 11, 2024

zhaoyang-star commented Jan 12, 2024

Lvjinhong commented Jan 13, 2024

zhaoyang-star commented Jan 13, 2024 • edited Loading

casper-hansen commented Jan 15, 2024

sighingnow commented Feb 23, 2024

WoosukKwon commented Aug 1, 2024

zhaoyang-star commented Jan 10, 2024 •

edited

Loading

zhaoyang-star commented Jan 11, 2024 •

edited

Loading

zhaoyang-star commented Jan 13, 2024 •

edited

Loading