Skip to content

Commit d5bf9bc

Browse files
sanyalingtoncharlifugshtras
authored
Add BF16 support to custom PA (#133)
* tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
1 parent 636ff01 commit d5bf9bc

File tree

4 files changed

+271
-157
lines changed

4 files changed

+271
-157
lines changed

benchmarks/kernels/benchmark_paged_attention.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from vllm._custom_C import paged_attention_custom
1010
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, create_kv_caches_with_random
1111

12-
NUM_BLOCKS = 1024
12+
NUM_BLOCKS = 1024 * 1024
1313
PARTITION_SIZE = 256
1414

1515

@@ -176,7 +176,7 @@ def run_cuda_benchmark(num_iters: int, profile: bool = False) -> float:
176176
if do_profile:
177177
latency = run_benchmark(num_iters=1, profile=True)
178178
else:
179-
latency = run_benchmark(num_iters=100, profile=False)
179+
latency = run_benchmark(num_iters=1000, profile=False)
180180
print(f"Kernel running time: {latency * 1000000:.3f} us")
181181

182182

0 commit comments

Comments
 (0)