[V1] Optimize block table transfer from CPU to GPU #11401

WoosukKwon · 2024-12-22T01:09:13Z

Currently, the block table transfer from CPU to GPU could be expensive because we send the entire block table ([batch_size, max_model_len // block_size]) every step. This PR optimizes the overhead by only sending the diffs from CPU to GPU, which is typically very small.

The solution in this PR relies on CUDA unified virtual addressing, so may not work in some environments. In such a case, we fall back to the original implementation (copying the entire block table tensor).

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

github-actions · 2024-12-22T01:09:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

youkaichao · 2024-12-23T05:24:03Z

csrc/prepare_inputs/copy_subranges.cu

+  int* d_matrix_tgt = matrix_tgt.data_ptr<int>();
+
+  // One thread block per row.
+  int blocks = n;


it seems this can easily oversubscribe GPU SMs.

youkaichao · 2024-12-23T05:25:21Z

csrc/prepare_inputs/copy_subranges.cu

+  int length = matrix_diff[row_id * 2 + 1];
+  int end = start + length;
+  int thread_idx = threadIdx.x;
+  for (int i = start + thread_idx; i < end; i += blockDim.x) {


most threads in the block would be idle, e.g. for decoding, there's only one or even no entry changes in the block table.

youkaichao · 2024-12-23T05:49:20Z

vllm/v1/worker/gpu_block_table.py

+            self.block_table_diff_np[row_idx, 0] = start
+            # Move-and-append is not allowed.
+            assert self.block_table_diff_np[row_idx, 1] == 0
+            self.block_table_diff_np[row_idx, 1] = num_blocks


for the non-uva case, we still need to keep track of the max-block-table-length, so that apply_diff only needs to copy max-block-table-length columns.

Good point. The problem is, the memcpy API requires the data to be in contiguous memory space: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

So when the block table tensor has the shape [batch_size, max_model_len] and if we slice over the second dimension, then we have to call the memcpy API batch_size times instead of once.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

tlrmchlsmth · 2024-12-29T18:43:14Z

csrc/prepare_inputs/copy_subranges.cu

+  int end = start + length;
+  int thread_idx = threadIdx.x;
+  for (int i = start + thread_idx; i < end; i += blockDim.x) {
+    int idx = row_offset + i;


Should row_offset and idx be int64_t? I.e. could they overflow an int32?

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify · 2025-01-15T11:23:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

wip

1aaced5

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify bot added the ci/build label Dec 22, 2024

WoosukKwon added 3 commits December 21, 2024 17:11

yapf

8a4180c

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Minor

03b1e6f

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Minor

0a669ee

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

youkaichao reviewed Dec 23, 2024

View reviewed changes

WoosukKwon added 9 commits December 22, 2024 22:16

Use default

ee965c9

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into v1-blocktable-opt

0420fb2

comments

3fdbd8e

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into v1-blocktable-opt

b938606

Merge branch 'main' into v1-blocktable-opt

ff5b103

Minor

bef6816

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Add test for uva

5292219

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

ca4f9e6

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Add kernel test

27e8eb2

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon marked this pull request as ready for review December 26, 2024 20:01

WoosukKwon requested review from alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tlrmchlsmth and ywang96 as code owners December 26, 2024 20:01

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 26, 2024

WoosukKwon added 3 commits December 26, 2024 18:52

Merge branch 'main' into v1-blocktable-opt

34d6cc2

Minor

6ba31aa

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

ruff

ebfbe12

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

tlrmchlsmth reviewed Dec 29, 2024

View reviewed changes

WoosukKwon marked this pull request as draft December 31, 2024 05:37

WoosukKwon added 4 commits January 1, 2025 03:10

Merge branch 'main' into v1-blocktable-opt

a6e5d7b

Minor

1260e43

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Minor

ba64a02

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Fix

1ca4298

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon mentioned this pull request Jan 2, 2025

[V1] Add BlockTable class #11693

Merged

WoosukKwon added 2 commits January 15, 2025 03:07

fix

f840b53

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

test

7097f31

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify bot added the needs-rebase label Jan 15, 2025

WoosukKwon mentioned this pull request Jan 15, 2025

[V1] Optimize block table copy from CPU to GPU (take 2) #12078

Closed

mergify bot added the v1 label Feb 5, 2025

hmellor closed this Aug 11, 2025

WoosukKwon deleted the v1-blocktable-opt branch November 23, 2025 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] Optimize block table transfer from CPU to GPU #11401

[V1] Optimize block table transfer from CPU to GPU #11401

Uh oh!

WoosukKwon commented Dec 22, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 22, 2024

Uh oh!

youkaichao Dec 23, 2024

Uh oh!

youkaichao Dec 23, 2024

Uh oh!

youkaichao Dec 23, 2024

Uh oh!

WoosukKwon Dec 23, 2024 •

edited

Loading

Uh oh!

tlrmchlsmth Dec 29, 2024

Uh oh!

mergify bot commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[V1] Optimize block table transfer from CPU to GPU #11401

[V1] Optimize block table transfer from CPU to GPU #11401

Uh oh!

Conversation

WoosukKwon commented Dec 22, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2024

Uh oh!

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Dec 29, 2024

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WoosukKwon commented Dec 22, 2024 •

edited by github-actions bot

Loading

WoosukKwon Dec 23, 2024 •

edited

Loading