Add KVCache trans for long sequence && tuned comm for faster Addreduce #279

abenmao · 2024-03-28T11:37:00Z

Transpose KVCache with env var "ENABLE_KV_TRANS" for long sequence
Tune addreduce between shm and ccl for faster comm

workload	baseline-2s	opt-2s
llama-2-13b bs24 in512 bf16 (SPR+HBM)	350ms	90ms
llama-2-70b bs32 in4096 bf16 (SPR)	1300ms	640ms

pujiang2018 · 2024-03-29T02:06:59Z

src/common/kvcache_tensor.h

+
 /**
 * Tensor specially designed for KV Cache
 * Naturaly, it could be represented in the shape of [seq_length][batch_size][head_num][head_size]


Please also modify the comments here?

Done~ New layout default disabled.

pujiang2018 · 2024-03-29T02:10:40Z

src/models/env_config.cpp

+    static int kvTrans = -1;
+    if (kvTrans == -1) {
+        kvTrans = (getenv("ENABLE_KV_TRANS") ? atoi(getenv("ENABLE_KV_TRANS")) : 0);
+        // if (kvTrans == 1)


please remove it if not used.

Added some comments.

pujiang2018 · 2024-03-29T02:21:39Z

src/models/env_config.cpp

    return catMlp == 1;
 }

+bool tunedComm() {


it is not so easy to understand "Tuned communication". add some comment?

…educe

abenmao force-pushed the perf/comm/kvcache branch from 4cfb6d9 to 99731f4 Compare March 28, 2024 12:42

abenmao changed the title ~~Add KVCache for long sequence && tuned comm for faster Addreduce~~ Add KVCache trans for long sequence && tuned comm for faster Addreduce Mar 28, 2024

pujiang2018 reviewed Mar 29, 2024

View reviewed changes

abenmao force-pushed the perf/comm/kvcache branch 2 times, most recently from 9dd2a0f to fc3e8ff Compare March 29, 2024 13:55

Add KVCache transpose for long sequence && tuned comm for faster Addr…

5da8280

…educe

abenmao force-pushed the perf/comm/kvcache branch from fc3e8ff to 5da8280 Compare March 30, 2024 04:25

pujiang2018 approved these changes Apr 1, 2024

View reviewed changes

pujiang2018 merged commit bb83063 into intel:main Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add KVCache trans for long sequence && tuned comm for faster Addreduce #279

Add KVCache trans for long sequence && tuned comm for faster Addreduce #279

Uh oh!

abenmao commented Mar 28, 2024 •

edited

Loading

Uh oh!

pujiang2018 Mar 29, 2024

Uh oh!

abenmao Mar 29, 2024

Uh oh!

pujiang2018 Mar 29, 2024

Uh oh!

abenmao Mar 29, 2024

Uh oh!

pujiang2018 Mar 29, 2024

Uh oh!

abenmao Mar 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add KVCache trans for long sequence && tuned comm for faster Addreduce #279

Add KVCache trans for long sequence && tuned comm for faster Addreduce #279

Uh oh!

Conversation

abenmao commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pujiang2018 Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

abenmao Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

pujiang2018 Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

abenmao Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

pujiang2018 Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

abenmao Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abenmao commented Mar 28, 2024 •

edited

Loading