-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix auto prefix bug #3239
Fix auto prefix bug #3239
Conversation
@ElizaWszola , note that yapf failing |
vllm/core/block_manager.py
Outdated
# TODO We exclude the last block to avoid the case where the entire | ||
# prompt is cached. This would currently cause erroneous behavior in | ||
# worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems more like a NOTE rather than a TODO, is there future work to consider here such as allowing workers to have zero work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ideally, the workers should be rewritten to handle zero workloads in the future. The current implementation has a small drawback of not being able to read the last block in sequence from cache even if it's there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this to a NOTE. Even if the whole prefix has been computed before, we still need to recompute on the last token to sample for the first output token. This is unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix LGTM! Is there any performance impact with this fix?
@zhuohan123 I ran old version with auto prefix caching new version with auto prefix caching old version without auto prefix caching new version without auto prefix caching |
Resolves #3193
Fixes a bug that occurs when the entire prompt has already been computed. If the entire prompt is marked as computed, the model runner will attempt to create zero-sized tensors for non-computed prompt tokens. This throws an exception in torch.arrange.
This fix is somewhat of a band-aid because it just makes sure that the last block is never considered computed. This will result in a small amount of unnecessary computation when running the same prompt multiple times.
A more comprehensive fix would update the model runner to behave correctly when it detects that all blocks coming in have already been computed. This can be done in a subsequent change.