Skip to content

Conversation

@orozery
Copy link
Contributor

@orozery orozery commented Jun 17, 2025

This PR move the request block hashes from the KVCacheManager to the Request object itself.
In particular, this will allow connectors to access the request block hashes.

Part of the work described in RFC #19854

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jun 17, 2025
@mergify
Copy link

mergify bot commented Jun 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 17, 2025
@orozery orozery force-pushed the request-block-hashes branch from 457540f to a34e6ac Compare June 17, 2025 06:36
@mergify mergify bot removed the needs-rebase label Jun 17, 2025
@mergify
Copy link

mergify bot commented Jun 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 19, 2025
@orozery orozery force-pushed the request-block-hashes branch from a34e6ac to 15001b4 Compare June 19, 2025 09:19
@mergify mergify bot removed the needs-rebase label Jun 19, 2025
@orozery orozery mentioned this pull request Jun 19, 2025
1 task
@mergify
Copy link

mergify bot commented Jun 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 25, 2025
@orozery orozery force-pushed the request-block-hashes branch from 15001b4 to 914876e Compare June 26, 2025 07:21
@mergify mergify bot removed the needs-rebase label Jun 26, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me but would be good to get approval from @heheda12345
and @WoosukKwon.

@njhill njhill requested a review from heheda12345 July 2, 2025 18:15
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major concern is that the block_hash list is updated here

block_hashes.append(block_hash)

It is quite far from Request class. It will make the number of element in the list magic to connectors, and hurt the modularity of the code.
Therefore, if you want to put it into Request class, can you move this logic from "update when block hash is needed" to "update when token ids of the request are updated"?

@orozery
Copy link
Contributor Author

orozery commented Jul 6, 2025

My major concern is that the block_hash list is updated here

block_hashes.append(block_hash)

It is quite far from Request class. It will make the number of element in the list magic to connectors, and hurt the modularity of the code.
Therefore, if you want to put it into Request class, can you move this logic from "update when block hash is needed" to "update when token ids of the request are updated"?

Today, block_hashes are computed when the scheduler calls _update_waiting_for_remote_kv.
Shortly after this call, calls are made to kv_cache_manager.get_computed_blocks and connector.get_num_new_matched_tokens.
The connector must get the block hashes at this point.

I'm trying to interpret your suggestion "update when token ids of the request are updated".
My best guess is that you mean request.append_output_token_ids. This is too late for the connector.
I'm guessing that I do not understand you.
Can you please be more concrete and reference to actual code locations?

@orozery orozery force-pushed the request-block-hashes branch from 914876e to 234f334 Compare July 6, 2025 07:56
@heheda12345
Copy link
Collaborator

Yes I mean request.append_output_token_ids.
Based on the code here, the new tokens are updated in update_from_output in step k, and then used by the connector in scheduler.schedule() of step k+1. Why is it too late?

vllm/vllm/v1/engine/core.py

Lines 223 to 240 in 72d14d0

def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
"""Schedule, execute, and make output.
Returns tuple of outputs and a flag indicating whether the model
was executed.
"""
# Check for any requests remaining in the scheduler - unfinished,
# or finished and not yet removed from the batch.
if not self.scheduler.has_requests():
return {}, False
scheduler_output = self.scheduler.schedule()
model_output = self.execute_model(scheduler_output)
engine_core_outputs = self.scheduler.update_from_output(
scheduler_output, model_output) # type: ignore
return (engine_core_outputs,
scheduler_output.total_num_scheduled_tokens > 0)

@orozery
Copy link
Contributor Author

orozery commented Jul 8, 2025

Yes I mean request.append_output_token_ids. Based on the code here, the new tokens are updated in update_from_output in step k, and then used by the connector in scheduler.schedule() of step k+1. Why is it too late?

vllm/vllm/v1/engine/core.py

Lines 223 to 240 in 72d14d0

def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
"""Schedule, execute, and make output.
Returns tuple of outputs and a flag indicating whether the model
was executed.
"""
# Check for any requests remaining in the scheduler - unfinished,
# or finished and not yet removed from the batch.
if not self.scheduler.has_requests():
return {}, False
scheduler_output = self.scheduler.schedule()
model_output = self.execute_model(scheduler_output)
engine_core_outputs = self.scheduler.update_from_output(
scheduler_output, model_output) # type: ignore
return (engine_core_outputs,
scheduler_output.total_num_scheduled_tokens > 0)

@heheda12345 Consider the following scenario:

  1. Engine gets a fresh new prompt with 1000 (number does not matter) tokens.
  2. Scheduler pops the prompt from its self.waiting list.
  3. Scheduler checks whether this prompt has already computed tokens in the GPU prefix cache, by calling self.kv_cache_manager.get_computed_blocks. Assumes no hits are found, so num_computed_tokens = 0.
  4. Next, it queries the connector of any externally hit tokens, by calling self.connector.get_num_new_matched_tokens(request, ...). At this point, the connector needs the block hashes in order to check for hits.
  5. Assuming your approach of delaying the setting of request.block_hashes, the connector will yield num_external_computed_tokens = 0, even though it may have all the tokens available!
  6. The request tokens will be sent to the workers, and the workers will recompute the KV-values for this 1000 tokens.
  7. Only after the workers re-compute the tokens, the scheduler will call update_from_output, setting the block_hashes.

So to sum-up, because of the delay in setting of request.block_hashes, we lose the ability to utilize offloaded KV-values of tokens from external source.

@heheda12345
Copy link
Collaborator

But I think the hash of the prompt can be initialized when the request is created

@njhill
Copy link
Member

njhill commented Aug 11, 2025

@orozery could you rebase on latest main? Should hopefully help with at least some of the CI failures.

auto-merge was automatically disabled August 11, 2025 19:16

Head branch was pushed to by a user without write access

@orozery orozery force-pushed the request-block-hashes branch from 0241ee6 to 564a359 Compare August 11, 2025 19:16
@orozery
Copy link
Contributor Author

orozery commented Aug 11, 2025

@orozery could you rebase on latest main? Should hopefully help with at least some of the CI failures.

done

@heheda12345 heheda12345 enabled auto-merge (squash) August 12, 2025 17:32
auto-merge was automatically disabled August 14, 2025 12:35

Head branch was pushed to by a user without write access

@orozery orozery force-pushed the request-block-hashes branch 5 times, most recently from b3e1288 to e440daf Compare August 14, 2025 18:40
@heheda12345
Copy link
Collaborator

Can you rebase with main again? It should fix the test_eagle_correctness failure.

This commit move the request block hashes from the KVCacheManager
to the Request object itself.
In particular, this will allow connectors to access the request block hashes.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the request-block-hashes branch from e440daf to 3d0762a Compare August 15, 2025 03:40
@heheda12345 heheda12345 merged commit c280066 into vllm-project:main Aug 15, 2025
40 checks passed
666even666 pushed a commit to 666even666/vllm that referenced this pull request Aug 18, 2025
…llm-project#19728)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
…llm-project#19728)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…llm-project#19728)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants