Skip to content

Refactor attention kernels #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 3, 2023
Merged

Refactor attention kernels #53

merged 17 commits into from
May 3, 2023

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented May 2, 2023

This PR refactors attention kernels, making the helper functions more modular and pruning unused code. This PR will make it easier to add support for a new data type such as bfloat16.

In addition, this PR reduces the computation overhead of the attention kernel, by using the reduced precision (i.e., fp16) for logits * V instead of the full precision. This is compatible with the FasterTransformer's implementation.

@WoosukKwon WoosukKwon requested a review from zhuohan123 May 2, 2023 07:38
@@ -0,0 +1,5 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the define guard instead of #pragma once per Google's C++ style guide :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either options have pros and cons. I think it's safe to use #pragma once, because it is commonly used in DL projects such as PyTorch and FasterTransformer.

@WoosukKwon WoosukKwon requested a review from zhuohan123 May 3, 2023 06:27
@WoosukKwon
Copy link
Collaborator Author

WoosukKwon commented May 3, 2023

Performance (batch_size=8, context_len=512, num_heads=40, head_size=128):

Before: 83.4 us
After: 82.5 us

There's slight improvement in the kernel performance due to the use of fp16 in logits * values.

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Jun 18, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024
…kar-amd-patch-1

Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025
### What this PR does / why we need it?
1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model
serving doc
2. fix format of files in `docs` dir, e.g. format tables, add underline
for links, add line feed...

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

no.

### How was this patch tested?
doc CI passed

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants