[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. #1940

Narsil · 2024-05-23T13:15:13Z

What does this PR do?

This PR proposes a long standing change, which is to move towards using FlashDecoding instead of PagedAttention.

FlashDecoding is supposedly faster than paged (on par at best in early testing, but needs for thorough testing).
More importantly it will unlock lots of new use cases for much faster speedups (not in current PR).

FlashDecoding defines it's signature as (query, cu_seqlen_q) + (kv, cu_seqlen_kv) + block_tables (to simplify).
This means we can in a single attention pass, merge prefill and decodes, but most importantly we can have huge query_lengths at query times. With current paged kernels,
there is a hard assumption that Q lengths = 1. For medusa speculation, we're currently faking it by duplicating "queries" in the query slots and adjusting input_lengths and slots.
The longer the query the more wasteful it is (which is ok for small sizes).

With FlashDecoding the expected upsides are:

Overall speedups since FlashDecoding is supposed to be faster.
KV-cache hits (to prevent recomputing kv-cache for common prefixes like system messages or assistant prompts, could have been done with Paged only).
Cleaner+Faster with Speculation methods.
Insane speedups for harshly constrained grammars (like JSON, fill-a-hole kind of prompting).

Current takeaways:

Speed improvement is not really there it's ISO with Paged at best, and has slightly worse scaling seqlen or batch_size (5-10% slower) (potentially linked to the fact that FlashDecoding from FA2 implements only block_size=256, which we may be able to update).
Not all the code has been adapted, meaning there are still a few optimizations left (at every call size we recreate cu_seqlen_{q,kv} which is more kernels and more overhead independant of FA's performance.
API is much cleaner of varlen queries (even with KV cache).

Why not FashInfer (or others)?

API for KV-cache is much more different than FD : https://docs.flashinfer.ai/api/python/prefill.html#batch-prefill-append-attention
It requires some scratch buffers (of unclear size) , and keeping hold of them.
It requires very different bookkeeping splitting the pages and last_page_indices separately (meaning lot more changes to get it to work).

Layout being the same has FD, it can be explored if performance is there (API would allow similar features).

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

router/src/infer.rs

Narsil · 2024-05-30T09:25:52Z

server/Makefile-flash-att-v2

@@ -1,11 +1,11 @@
-flash_att_v2_commit_cuda := 23e8fa5a263d1c7122bc46a86ef32030ee7130f9
+flash_att_v2_commit_cuda := v2.5.8


We can actually pip install now.

danieldk

Looks like a pretty straightforward change, added some comments.

danieldk · 2024-06-27T10:57:14Z

server/text_generation_server/layers/attention/cuda.py

    num_seqs, num_heads, head_size = query.shape
    max_num_partitions = (max_s + _PARTITION_SIZE - 1) // _PARTITION_SIZE
+    input_lengths = cu_seqlen_k

    # NOTE(woosuk): We use a simple heuristic to decide whether to use


NIT: this comment should move down to the paged attention version condition.

danieldk · 2024-06-27T11:20:17Z

server/text_generation_server/models/custom_modeling/flash_cohere_modeling.py

+            cu_seqlen_k = torch.cat(
+                [
+                    torch.zeros(
+                        (1,), device=input_lengths.device, dtype=input_lengths.dtype
+                    ),
+                    input_lengths.cumsum(dim=-1),
+                ]
+            ).to(dtype=torch.int32)


Not sure if this is premature optimization, but saves two allocations:

cu_seqlen_k = torch.empty(input_lengths.size(-1) + 1, device=input_lengths.device, dtype=torch.int32) cu_seqlen_k[0] = 0 torch.cumsum(input_lengths, -1, out=cu_seqlen_k[1:])

danieldk · 2024-06-27T11:24:53Z

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

@@ -368,6 +373,23 @@ def forward(
        cos, sin = self.layers[0].self_attn.rotary_emb.get_cos_sin(
            position_ids, max_s, hidden_states.dtype
        )
+        if cu_seqlen_prefill is None and FLASH_DECODING:


Maybe we could do this in the paged_attention function? Then it has a non-ambiguous signature and we don't have to add this to all the models.

The problem is that the tensor creation adds too much overhead.

I did it that way initially and the performance were worse than raw paged just because of that.

We could also maybe to it all the way up in flash_causal_lm. That was my next best idea (but I don't like obfuscating tensor content since then the tensors might be either cu_seqlen_q and cu_seqlen_korNoneandinput_lengths` (we could dataclass stuff and all sorts of shenanigans, still obfuscation I feel.

Given the totally optional nature of flash decoding for now, I'm ok if this lives into this particular modeling code while we test, and either rollback or finish the work and put everything into causal_lm once there's only 1 format (biggest drawback will be AMD and intel which do not support FA2 with paged afaik)

danieldk · 2024-06-27T11:26:55Z

server/text_generation_server/layers/attention/cuda.py

@@ -32,7 +40,8 @@ def paged_attention(
    kv_head_mapping: torch.Tensor,
    softmax_scale: float,
    block_tables: torch.Tensor,
-    input_lengths: torch.Tensor,
+    cu_seqlen_q: torch.Tensor,


Suggested change

cu_seqlen_q: torch.Tensor,

cu_seqlen_q: Optional[torch.Tensor],

danieldk · 2024-06-27T11:30:43Z

server/text_generation_server/models/custom_modeling/flash_gemma_modeling.py

@@ -253,6 +253,7 @@ def forward(
                self.kv_head_mapping,
                self.softmax_scale,
                block_tables,
+                None,


Breaks when flash decoding is enabled?

Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output.

…ttention kernel. (#1940) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.

…1940) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.

Narsil force-pushed the flashdecoding branch from fcc15b8 to cacba5f Compare May 24, 2024 13:58

Narsil commented May 30, 2024

View reviewed changes

router/src/infer.rs Outdated Show resolved Hide resolved

Narsil commented May 30, 2024

View reviewed changes

Narsil force-pushed the flashdecoding branch 2 times, most recently from f2813ee to 2c6430d Compare May 31, 2024 16:56

Narsil mentioned this pull request Jun 3, 2024

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes #1977

Closed

Narsil force-pushed the flashdecoding branch 2 times, most recently from 7085898 to b5ff704 Compare June 25, 2024 15:07

danieldk reviewed Jun 27, 2024

View reviewed changes

Narsil force-pushed the flashdecoding branch from b5ff704 to 573d88c Compare July 1, 2024 10:55

Narsil added 13 commits July 1, 2024 13:42

Making it work on non flash decoding.

66081e6

Fix Cohere.

b98b94d

Fix non decoding paths.

988aa34

Rebased.

4f1b1a2

No need for cache_manager anymore.

fcbc687

Update?

212a595

"ipex" -> "cpu"

5f38d79

These do not belong.

65980ed

Factoring cu_seqlen_qk for better abstracting over every model.

4b1364d

Fixing non flash tests/imports.

a26e57f

Changing return everywhere.

8fa8cda

Update mistral past.

1bd5215

Narsil force-pushed the flashdecoding branch from 7df68d8 to 1bd5215 Compare July 1, 2024 13:43

Narsil added 3 commits July 1, 2024 16:16

Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

b686f66

Fixup mistral clamping (had issues with cuda graphs).

ef8bce0

No need to recreate anything actually.

1c7c21d

danieldk approved these changes Jul 1, 2024

View reviewed changes

Narsil merged commit 4327210 into main Jul 1, 2024
9 checks passed

Narsil deleted the flashdecoding branch July 1, 2024 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. #1940

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. #1940

Narsil commented May 23, 2024

Narsil May 30, 2024

danieldk left a comment

danieldk Jun 27, 2024

danieldk Jun 27, 2024

danieldk Jun 27, 2024

Narsil Jun 27, 2024

Narsil Jun 27, 2024

danieldk Jun 27, 2024

danieldk Jun 27, 2024

		@@ -1,11 +1,11 @@
		flash_att_v2_commit_cuda := 23e8fa5a263d1c7122bc46a86ef32030ee7130f9
		flash_att_v2_commit_cuda := v2.5.8

	cu_seqlen_q: torch.Tensor,
	cu_seqlen_q: Optional[torch.Tensor],

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. #1940

[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. #1940

Conversation

Narsil commented May 23, 2024

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

danieldk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment