[KVCACHE] Improved schedule for prefill attention #17482

krishnaraj36 · 2024-10-22T09:32:29Z

Improvements -
Added Tranpose to K for better Vectorization during Matmul. Improved Load Schedule.
Improved a bit more than 2x is most cases.
Llama-2 7B observation
-----------kernel----------------baseline----------optimized

---batch_prefill_ragged_kv------15 ms-------------7.1 ms

This PR fixes the issue addressed in the PR #17446. The correctness issue is caused by incorrect code-generation during the unroll phase. Thus, we removed the explicit unroll and noticed little to no performance degradation.

We generated OpenCL kernels extracting the generated modules by setting num_qo_heads=28 in
https://github.qualcomm.com/gpgpu/apache-tvm/blob/85e15d494d5a42360859941cbc972c4f175c3b94/tests/python/relax/test_runtime_builtin_paged_attention_kv_cache_flashinfer.py#L36
Original PR Codegen

int cur_L_3 = ((((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) / 7) + (((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) % 7) >> 31)) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]);
if (cur_L_3 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[3] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)]))))), 0, output + (((((cur_L_3 * 3584) + ((convert_int(get_group_id(1))) * 896)) + ((((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) % 7) + (7 & (((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) % 7) >> 31))) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)) + 4));
}
int cur_L_4 = ((((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) - 2147483637) / 7) - -306783377) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]);
if (cur_L_4 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[4] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)]))))), 0, output + ((((cur_L_4 * 3584) + ((convert_int(get_group_id(1))) * 896)) + (((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) - 2147483637) % 7) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)));
}

In the O_store block we notice large and incorrect pointer offsets were being generated during subsequent stages of unroll. This can be indirectly noted zero elements contained in the output and compute instability.

Fusing the unroll loops to unroll together doesn't seem to resolve this.

Oddly enough, the initial test case doesn't seem to trigger the issue and works as intended.

int cur_L_3 = ((((((convert_int(get_local_id(0))) >> 4) + ((LH_start + 1) >> 2)) >> 1) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]) + (convert_int(get_local_id(1))));
if (cur_L_3 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[3] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)]))))), 0, output + (((((cur_L_3 * 4096) + ((convert_int(get_group_id(1))) * 1024)) + (((((((convert_int(get_local_id(0))) >> 4) * 4) + (LH_start & 7)) + 1) & 7) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)) + 4));
}
int cur_L_4 = ((((((convert_int(get_local_id(0))) >> 4) + ((LH_start + 2) >> 2)) >> 1) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]) + (convert_int(get_local_id(1))));
 if (cur_L_4 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[4] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)]))))), 0, output + ((((cur_L_4 * 4096) + ((convert_int(get_group_id(1))) * 1024)) + (((((((convert_int(get_local_id(0))) >> 4) * 4) + (LH_start & 7)) + 2) & 7) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)));
}

Improvements Added Tranpose to K for better Vectorization during Matmul. Improved Load Schedule. Improved a bit more than 2x is most cases. Llama-2 7B observation -----------kernel----------------baseline----------optimized- ---batch_prefill_ragged_kv------15 ms-------------7.1 ms

krishnaraj36 · 2024-10-22T09:38:16Z

@MasterJH5574 @tqchen
We have fixed the issue raise in PR (#17466).
Can you please look at this PR.

MasterJH5574

Thank you @krishnaraj36 so much for the fix!

MasterJH5574 · 2024-10-22T18:44:16Z

I have also observed the “large and incorrect” pointer offset before but I didn't get time to nail down the issue. Roughly I remember it's generated by some floordiv simplification in src/tir/transforms/lower_intrin.cc.

krishnaraj36 · 2024-10-23T04:07:22Z

Thank you @krishnaraj36 so much for the fix!
@MasterJH5574
There is only one change(removed sch.unroll(xi) ) on previous commit which was reverted.

Improvements Added Tranpose to K for better Vectorization during Matmul. Improved Load Schedule. Improved a bit more than 2x is most cases. Llama-2 7B observation -----------kernel----------------baseline----------optimized- ---batch_prefill_ragged_kv------15 ms-------------7.1 ms

MasterJH5574 approved these changes Oct 22, 2024

View reviewed changes

srkreddy1238 merged commit e3e27f5 into apache:main Oct 28, 2024

ysh329 mentioned this pull request Jan 24, 2025

[Release] v0.19.0 Release Candidate Notes #17600

Closed

kurisu6912 mentioned this pull request Sep 5, 2025

kurisu add assume attr patch 1 tile-ai/tvm#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KVCACHE] Improved schedule for prefill attention #17482

[KVCACHE] Improved schedule for prefill attention #17482

Uh oh!

krishnaraj36 commented Oct 22, 2024 •

edited

Loading

Uh oh!

krishnaraj36 commented Oct 22, 2024

Uh oh!

MasterJH5574 left a comment

Uh oh!

MasterJH5574 commented Oct 22, 2024

Uh oh!

krishnaraj36 commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[KVCACHE] Improved schedule for prefill attention #17482

[KVCACHE] Improved schedule for prefill attention #17482

Uh oh!

Conversation

krishnaraj36 commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krishnaraj36 commented Oct 22, 2024

Uh oh!

MasterJH5574 left a comment

Choose a reason for hiding this comment

Uh oh!

MasterJH5574 commented Oct 22, 2024

Uh oh!

krishnaraj36 commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krishnaraj36 commented Oct 22, 2024 •

edited

Loading