Skip to content

Commit

Permalink
Minor fix in prefill cache example (vllm-project#2494)
Browse files Browse the repository at this point in the history
  • Loading branch information
JasonZhu1313 authored Jan 18, 2024
1 parent 8a25d3a commit 5d80a91
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions examples/offline_inference_with_prefix.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,16 @@
# -1 since the last token can change when concatenating prompts.
prefix_pos = len(llm.llm_engine.tokenizer.encode(prefix)) - 1

# Generate with prefix
outputs = llm.generate(generating_prompts, sampling_params,
# The llm.generate call will batch all prompts and send the batch at once if resources allow.
# The prefix will only be cached after the first batch is processed, so we need to call generate once
# to calculate the prefix and cache it.
outputs = llm.generate(generating_prompts[0],
sampling_params,
prefix_pos=[prefix_pos])

# Subsequent batches can leverage the cached prefix
outputs = llm.generate(generating_prompts,
sampling_params,
prefix_pos=[prefix_pos] * len(generating_prompts))

# Print the outputs. You should see the same outputs as before
Expand Down

0 comments on commit 5d80a91

Please sign in to comment.