Support RoPE position info in batch prefill/decode kernels #69

MasterJH5574 · 2024-01-13T04:55:24Z

This PR adds q/k position information to batch prefill/decode kernels. More specifically, the kernel now accepts two additional arrays:

q_rope_position with shape (total_q_len,), denoting the in-sequence position of each position in the input q.
k_rope_pos_offset with shape (num_sequence,), denoting the start position of each sequence in k.

These two arrays helps on-the-fly calculate RoPE in multi-level cases.

Tests test_batch_prefill and test_batch_decode can pass. Performance is not validated yet. Per discussion with Zihao, this change is not very likely to incur significant perf regression.

yzh119 · 2024-01-21T08:59:58Z

I'll merge this into the mainline after #75 gets merged.

yzh119 · 2024-01-31T19:12:01Z

Sorry about the new conflicts, I'll take care of them tmr.

This PR adds q/k position information to batch prefill/decode kernels. More specifically, the kernel now accepts two additional arrays: * `q_rope_position` with shape `(total_q_len,)`, denoting the in-sequence position of each position in the input q. * `k_rope_pos_offset` with shape `(num_sequence,)`, denoting the start position of each sequence in k. These two arrays helps on-the-fly calculate RoPE in multi-level cases. Tests `test_batch_prefill` and `test_batch_decode` can pass. Performance is not validated yet. Per discussion with Zihao, this change is not very likely to incur significant perf regression.

yzh119

LGTM, thank you @MasterJH5574 !

This PR fixes #113, which is because #69 changed the `BatchPrefillWithPagedKVCacheWrapperDispatched` signature, and `flashinfer_decl.h` was not updated accordingly. Also fixes some tiny format issues in #111.

Adds a common wrapper function to mma_ops.hpp for hgemm kernels that works for both CUDA and HIP. Replaces `mma_sync_m16n16k16_row_col_f16f16f32`

MasterJH5574 force-pushed the qk-rope-info branch 2 times, most recently from 5b189f5 to 47686ef Compare January 29, 2024 18:49

MasterJH5574 force-pushed the qk-rope-info branch from 47686ef to d89fe51 Compare February 1, 2024 15:49

yzh119 approved these changes Feb 1, 2024

View reviewed changes

yzh119 merged commit a389ed4 into flashinfer-ai:main Feb 1, 2024

yzh119 mentioned this pull request Feb 16, 2024

bugfix: fix the compilation issue of pip wheels #115

Merged

yzh119 mentioned this pull request Feb 23, 2024

[Compiling Issue] error: no instance of function template "flashinfer::BatchPrefillWithPagedKVCacheWrapper" matches the argument list #134

Closed

Tomorrowdawn mentioned this pull request Jun 23, 2024

QUESTION: How to implement a tree attention with flashinfer #152

Closed

diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025

Add a wrapper for hgemm kernel (#69)

622b0e8

Adds a common wrapper function to mma_ops.hpp for hgemm kernels that works for both CUDA and HIP. Replaces `mma_sync_m16n16k16_row_col_f16f16f32`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support RoPE position info in batch prefill/decode kernels #69

Support RoPE position info in batch prefill/decode kernels #69

Uh oh!

MasterJH5574 commented Jan 13, 2024

Uh oh!

yzh119 commented Jan 21, 2024

Uh oh!

yzh119 commented Jan 31, 2024

Uh oh!

yzh119 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support RoPE position info in batch prefill/decode kernels #69

Support RoPE position info in batch prefill/decode kernels #69

Uh oh!

Conversation

MasterJH5574 commented Jan 13, 2024

Uh oh!

yzh119 commented Jan 21, 2024

Uh oh!

yzh119 commented Jan 31, 2024

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants