[Feature] Add initial support for sequence parallelism #1436

Ying1123 · 2024-09-16T08:24:06Z

kuangdao · 2024-11-28T08:09:51Z

From the code i see, the prefill stage after attention, the shape of output is [padded_total_num_tokens, q_head_num // SP_SIZE, head_dim], and then out * RowSeqParallelLinear which need use allreduce. the input of qkv_proj_linear is [padded_total_num_tokens, q_head_num, head_dim] which not spilted by sp_size. i want to know why done use ring attention , ring attention seems better then it in both Computing and Communication.

Edenzzzz · 2024-12-03T22:05:53Z

python/sglang/srt/layers/radix_attention.py

+        For each SP worker, we have either (1) QKV of entire sequences:
+            q tensor: [padded_total_num_tokens, q_head_num // SP_SIZE, head_dim]
+            k tensor: [padded_total_num_tokens, k_head_num, head_dim]
+            v tensor: [padded_total_num_tokens, v_head_num, head_dim]
+        Or (2) Q of entire sequences and KV of the current SP shard:
+            q tensor: [padded_total_num_tokens, q_head_num // SP_SIZE, head_dim]
+            k tensor: [padded_sp_shard_num_tokens, k_head_num, head_dim]
+            v tensor: [padded_sp_shard_num_tokens, v_head_num, head_dim]
+
+        Case (1) saves cross-SP-worker communication, while case (2) saves computation
+        to get K and V for entire sequences but need computation in SP attn.
+        """


(2) seems to be able to split workload and overlap even with single query. But just curious, does anyone have opinions on TreeAttention (just all-reduce lse instead of sending KV), which seems optimized for decoding?

Edenzzzz · 2024-12-03T22:35:02Z

python/sglang/srt/layers/radix_attention.py

+        # TODO: in fact we can use all-to-all to gather the output and state here
+        # to collect only q head shards that are needed by the current SP worker.
+        # All-to-all will save communication and `merge_state` computation.


Later all-reduce in ColumnSeqParallelLinear ? Thx

add hybrid kv

71c8afe

Ying1123 force-pushed the seq-parallel branch from c263cb3 to 71c8afe Compare September 16, 2024 08:42

Ying1123 changed the title ~~Add initial support for sequence parallelism~~ [Feature] Add initial support for sequence parallelism Sep 16, 2024

merrymercy mentioned this pull request Sep 17, 2024

Sequence Parallel #1041

Closed

3 tasks

Ying1123 marked this pull request as draft September 19, 2024 01:39

merrymercy mentioned this pull request Sep 22, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

merrymercy force-pushed the main branch from 55311eb to 2134f08 Compare November 2, 2024 01:26

merrymercy assigned Ying1123 Nov 9, 2024

merrymercy added the await-response label Dec 1, 2024

Edenzzzz reviewed Dec 3, 2024

View reviewed changes

merrymercy force-pushed the main branch from 1ad76cd to 835f8af Compare December 9, 2024 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add initial support for sequence parallelism #1436

[Feature] Add initial support for sequence parallelism #1436

Ying1123 commented Sep 16, 2024

kuangdao commented Nov 28, 2024

Edenzzzz Dec 3, 2024 •

edited

Loading

Edenzzzz Dec 3, 2024 •

edited

Loading

[Feature] Add initial support for sequence parallelism #1436

Are you sure you want to change the base?

[Feature] Add initial support for sequence parallelism #1436

Conversation

Ying1123 commented Sep 16, 2024

kuangdao commented Nov 28, 2024

Edenzzzz Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Edenzzzz Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Edenzzzz Dec 3, 2024 •

edited

Loading

Edenzzzz Dec 3, 2024 •

edited

Loading