Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix auto prefix bug #3239

Merged
merged 4 commits into from
Mar 8, 2024
Merged

Conversation

ElizaWszola
Copy link
Contributor

Resolves #3193

Fixes a bug that occurs when the entire prompt has already been computed. If the entire prompt is marked as computed, the model runner will attempt to create zero-sized tensors for non-computed prompt tokens. This throws an exception in torch.arrange.

This fix is somewhat of a band-aid because it just makes sure that the last block is never considered computed. This will result in a small amount of unnecessary computation when running the same prompt multiple times.

A more comprehensive fix would update the model runner to behave correctly when it detects that all blocks coming in have already been computed. This can be done in a subsequent change.

@robertgshaw2-neuralmagic
Copy link
Collaborator

@ElizaWszola , note that yapf failing

Comment on lines 445 to 447
# TODO We exclude the last block to avoid the case where the entire
# prompt is cached. This would currently cause erroneous behavior in
# worker.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems more like a NOTE rather than a TODO, is there future work to consider here such as allowing workers to have zero work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally, the workers should be rewritten to handle zero workloads in the future. The current implementation has a small drawback of not being able to read the last block in sequence from cache even if it's there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to a NOTE. Even if the whole prefix has been computed before, we still need to recompute on the last token to sample for the first output token. This is unavoidable.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix LGTM! Is there any performance impact with this fix?

@ElizaWszola
Copy link
Contributor Author

@zhuohan123 I ran benchmark_throughput.py with and without auto prefix caching enabled, and it looks like the code is slightly faster after the fix:

old version with auto prefix caching
Throughput: 8.61 requests/s, 4166.47 tokens/s

new version with auto prefix caching
Throughput: 8.92 requests/s, 4317.72 tokens/s

old version without auto prefix caching
Throughput: 8.98 requests/s, 4347.19 tokens/s

new version without auto prefix caching
Throughput: 9.10 requests/s, 4401.76 tokens/s

@zhuohan123 zhuohan123 merged commit b35cc93 into vllm-project:main Mar 8, 2024
23 checks passed
dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic Prefix Caching Bug
4 participants