Skip to content

Optimize flash varlen paged KV cache addressing#3564

Open
yysheng26 wants to merge 1 commit into
flagos-ai:masterfrom
yysheng26:optimize/flash-varlen-k-page-stride
Open

Optimize flash varlen paged KV cache addressing#3564
yysheng26 wants to merge 1 commit into
flagos-ai:masterfrom
yysheng26:optimize/flash-varlen-k-page-stride

Conversation

@yysheng26
Copy link
Copy Markdown
Contributor

PR Category

Benchmark

Type of Change

Performance Optimization

Description

Optimize paged KV cache address calculation in flash_attn_varlen_func.

This PR adds a contiguous KV cache fastpath and lets k_page_stride participate in Triton kernel specialization. For contiguous paged KV cache, the kernel uses the simpler row-id based offset formula. For non-contiguous paged KV cache, specializing k_page_stride avoids the slow generic runtime-stride address path.

A non-contiguous KV cache benchmark is also added to cover the case where k.stride(0) != block_size * k.stride(-3).

Issue

Associated with PR #3410.

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

Benchmark command:

pytest benchmark/test_flash_attn_varlen_func.py -v -s

Long paged KV case, fp16:

  • Contiguous: 0.797 ms
  • Non-contiguous before: ~1.62 ms
  • Non-contiguous after: ~0.80 ms

The non-contiguous long case is now close to the contiguous fastpath performance.

).to(tl.int64)
else:
page_block_index = tl.load(page_table_ptr + virtual_page_index).to(tl.int64)
if IS_CONTIGUOUS_KVCACHE:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after we remove k_page_stride from do_not_specialize list, are we still need this if IS_CONTIGUOUS_KVCACHE ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants