Fix auto prefix bug #3239

ElizaWszola · 2024-03-06T16:52:54Z

Resolves #3193

Fixes a bug that occurs when the entire prompt has already been computed. If the entire prompt is marked as computed, the model runner will attempt to create zero-sized tensors for non-computed prompt tokens. This throws an exception in torch.arrange.

This fix is somewhat of a band-aid because it just makes sure that the last block is never considered computed. This will result in a small amount of unnecessary computation when running the same prompt multiple times.

A more comprehensive fix would update the model runner to behave correctly when it detects that all blocks coming in have already been computed. This can be done in a subsequent change.

robertgshaw2-redhat · 2024-03-06T16:56:38Z

@ElizaWszola , note that yapf failing

mgoin · 2024-03-06T20:34:00Z

vllm/core/block_manager.py

+        # TODO We exclude the last block to avoid the case where the entire
+        # prompt is cached. This would currently cause erroneous behavior in
+        # worker.


this seems more like a NOTE rather than a TODO, is there future work to consider here such as allowing workers to have zero work?

Yes, ideally, the workers should be rewritten to handle zero workloads in the future. The current implementation has a small drawback of not being able to read the last block in sequence from cache even if it's there.

Let's change this to a NOTE. Even if the whole prefix has been computed before, we still need to recompute on the last token to sample for the first output token. This is unavoidable.

zhuohan123

The fix LGTM! Is there any performance impact with this fix?

ElizaWszola · 2024-03-07T15:03:15Z

@zhuohan123 I ran benchmark_throughput.py with and without auto prefix caching enabled, and it looks like the code is slightly faster after the fix:

old version with auto prefix caching
Throughput: 8.61 requests/s, 4166.47 tokens/s

new version with auto prefix caching
Throughput: 8.92 requests/s, 4317.72 tokens/s

old version without auto prefix caching
Throughput: 8.98 requests/s, 4347.19 tokens/s

new version without auto prefix caching
Throughput: 9.10 requests/s, 4401.76 tokens/s

ElizaWszola added 2 commits March 6, 2024 08:13

Fix auto prefix bug

3e2db8b

Loop fix, unit test

4d234ec

format

a879230

mgoin reviewed Mar 6, 2024

View reviewed changes

zhuohan123 approved these changes Mar 7, 2024

View reviewed changes

Change TODO into NOTE

22cc778

zhuohan123 merged commit b35cc93 into vllm-project:main Mar 8, 2024

MeloYang05 mentioned this pull request Mar 19, 2024

[Prefill with Prefix Cache] Improve the efficiency of prefilling with prefix cache by allowing a larger batch size #3402

Closed

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Fix auto prefix bug (vllm-project#3239)

a1d07e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix auto prefix bug #3239

Fix auto prefix bug #3239

Uh oh!

ElizaWszola commented Mar 6, 2024

Uh oh!

robertgshaw2-redhat commented Mar 6, 2024

Uh oh!

mgoin Mar 6, 2024

Uh oh!

ElizaWszola Mar 7, 2024

Uh oh!

zhuohan123 Mar 7, 2024

Uh oh!

zhuohan123 left a comment

Uh oh!

ElizaWszola commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Fix auto prefix bug #3239

Fix auto prefix bug #3239

Uh oh!

Conversation

ElizaWszola commented Mar 6, 2024

Uh oh!

robertgshaw2-redhat commented Mar 6, 2024

Uh oh!

mgoin Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants