Refactor attention kernels #53

WoosukKwon · 2023-05-02T07:38:04Z

This PR refactors attention kernels, making the helper functions more modular and pruning unused code. This PR will make it easier to add support for a new data type such as bfloat16.

In addition, this PR reduces the computation overhead of the attention kernel, by using the reduced precision (i.e., fp16) for logits * V instead of the full precision. This is compatible with the FasterTransformer's implementation.

zhuohan123 · 2023-05-02T10:01:43Z

csrc/attention/attention_dtypes.cuh

@@ -0,0 +1,5 @@
+#pragma once


Let's use the define guard instead of #pragma once per Google's C++ style guide :)

Either options have pros and cons. I think it's safe to use #pragma once, because it is commonly used in DL projects such as PyTorch and FasterTransformer.

This reverts commit b3f4c38.

WoosukKwon · 2023-05-03T06:56:36Z

Performance (batch_size=8, context_len=512, num_heads=40, head_size=128):

Before: 83.4 us
After: 82.5 us

There's slight improvement in the kernel performance due to the use of fp16 in logits * values.

…rovements dockerfile improvements

…ect#53)

…kar-amd-patch-1 Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""

### What this PR does / why we need it? 1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model serving doc 2. fix format of files in `docs` dir, e.g. format tables, add underline for links, add line feed... ### Does this PR introduce _any_ user-facing change?  no. ### How was this patch tested? doc CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

WoosukKwon added 10 commits May 2, 2023 05:12

Use reduced precision for attention computation

e1b1303

Code cleaning

b576e46

Create attention dir

8bdb09c

Refactor attention_utils

4b10136

Move

c25cea4

Move blocksum to attention

8ad2170

Refactor attention kernel

3a2337b

Move vector data types

f299f32

Remove cuda_primitives

e500681

Change .h to .cuh

cd82a10

WoosukKwon requested a review from zhuohan123 May 2, 2023 07:38

Minor fix

beb21bf

WoosukKwon mentioned this pull request May 2, 2023

Support bfloat16 data type #54

Merged

zhuohan123 reviewed May 2, 2023

View reviewed changes

WoosukKwon added 5 commits May 3, 2023 05:52

[Minor] Fix indentation

48f0d87

[Minor] Code formatting

ddc4d81

Replace pragma once with header guard

b3f4c38

Revert "Replace pragma once with header guard"

a5c02e9

This reverts commit b3f4c38.

Remove test for multi_query_cached_kv

e84bf29

WoosukKwon requested a review from zhuohan123 May 3, 2023 06:27

Merge branch 'main' into dtype

725d735

WoosukKwon merged commit 436e523 into main May 3, 2023

WoosukKwon deleted the dtype branch May 3, 2023 20:40

This was referenced May 4, 2023

A critical bug in attention kernel after refactoring #66

Closed

Fix a bug in attention kernel #68

Merged

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Refactor attention kernels (vllm-project#53)

29671c4

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Jun 18, 2024

Merge pull request vllm-project#53 from opendatahub-io/dockerfile-imp…

55c9ec3

…rovements dockerfile improvements

yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024

Add empty tensor initialization to LazyCompressedParameter (vllm-proj…

9eb83fe

…ect#53)

ZHJ19970917 mentioned this pull request Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024

Merge pull request vllm-project#53 from ROCm/revert-51-revert-33-diva…

4aba555

…kar-amd-patch-1 Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

sgsdxzy mentioned this pull request May 4, 2025

[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0 #17639

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor attention kernels #53

Refactor attention kernels #53

Uh oh!

WoosukKwon commented May 2, 2023 •

edited

Loading

Uh oh!

zhuohan123 May 2, 2023

Uh oh!

WoosukKwon May 3, 2023

Uh oh!

WoosukKwon commented May 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Refactor attention kernels #53

Refactor attention kernels #53

Uh oh!

Conversation

WoosukKwon commented May 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 May 2, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon May 3, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented May 2, 2023 •

edited

Loading

WoosukKwon commented May 3, 2023 •

edited

Loading