-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Refactor attention kernels #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,5 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the define guard instead of #pragma once
per Google's C++ style guide :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either options have pros and cons. I think it's safe to use #pragma once
, because it is commonly used in DL projects such as PyTorch and FasterTransformer.
Performance (batch_size=8, context_len=512, num_heads=40, head_size=128):
There's slight improvement in the kernel performance due to the use of fp16 in |
…rovements dockerfile improvements
…kar-amd-patch-1 Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""
### What this PR does / why we need it? 1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model serving doc 2. fix format of files in `docs` dir, e.g. format tables, add underline for links, add line feed... ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no. ### How was this patch tested? doc CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
This PR refactors attention kernels, making the helper functions more modular and pruning unused code. This PR will make it easier to add support for a new data type such as bfloat16.
In addition, this PR reduces the computation overhead of the attention kernel, by using the reduced precision (i.e., fp16) for
logits * V
instead of the full precision. This is compatible with the FasterTransformer's implementation.