small improvement in flash attention #1732

minhthuc2502 · 2024-06-25T15:03:18Z

Some benchmarks with FA2:
Tested on GPU NVIDIA H100, model LLama2 7B quantized 8bit:

In case seq-length ~= 1100 tokens:

In case seq-length ~= 2200 tokens:

Method	Speed (tok/s)	Batch size
Standard MHA	65.6	1
Current FA2	80.9	1

small improvement in flash attention

94f60ac

minhthuc2502 mentioned this pull request Jun 26, 2024

Request to support FlashAttention in cuda attention.cc #1300

Closed

minhthuc2502 merged commit 72a461a into OpenNMT:master Jun 26, 2024
17 checks passed

Provide feedback