Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small improvement in flash attention #1732

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

minhthuc2502
Copy link
Collaborator

@minhthuc2502 minhthuc2502 commented Jun 25, 2024

Some benchmarks with FA2:
Tested on GPU NVIDIA H100, model LLama2 7B quantized 8bit:

Method Speed (tok/s) Batch size
Standard MHA 85.8 1
Standard MHA 133.6 2
Previous FA2 88.3 1
Previous FA2 136.5 2
Current FA2 90.1 1
Current FA2 139.9 2

In case seq-length ~= 1100 tokens:

Method Speed (tok/s) Batch size
Standard MHA 71.2 1
Previous FA2 81.8 1
Current FA2 85 1

In case seq-length ~= 2200 tokens:

Method Speed (tok/s) Batch size
Standard MHA 65.6 1
Current FA2 80.9 1

@minhthuc2502 minhthuc2502 merged commit 72a461a into OpenNMT:master Jun 26, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant