-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths #13095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths #13095
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
…prefix-prefill-refactor
…prefix-prefill-refactor
Signed-off-by: Sage Moore <sage@neuralmagic.com>
cur_batch_ctx_len = tl.load(B_Ctxlen + cur_batch) | ||
cur_batch_seq_len = tl.load(B_Seqlen + cur_batch) | ||
cur_batch_in_all_start_index = tl.load(B_Start_Loc + cur_batch) | ||
cur_batch_query_len = cur_batch_seq_len - cur_batch_ctx_len | ||
cur_batch_in_all_stop_index = tl.load(B_Start_Loc + cur_batch + 1) | ||
cur_batch_query_len = (cur_batch_in_all_stop_index - | ||
cur_batch_in_all_start_index) | ||
cur_batch_ctx_len = cur_batch_seq_len - cur_batch_query_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would make sense to factor some of this out, since there's some repetition? I am suggesting this but can't immediately tell if it would actually improve things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be looking better in follow up PR: #13305
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
… longer has to pass in the context lengths (vllm-project#13095)
… longer has to pass in the context lengths (vllm-project#13095) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
… longer has to pass in the context lengths (vllm-project#13095)
This patch changes the prefix_prefill kernel so that it will calculate the context length using the query length and the sequence length, both of which are already passed in. This makes the kernel a bit more usable on V1 where we don't keep track of the context lengths tensor in the attention meta data.