Description
Motivation.
Porting logprobs support to v1 was key for completeness. APC is an important performance optimization. #9880 adds sample and prompt logprobs support, however prompt logprobs currently require the server to be instantiated with --no-enable-prefix-caching
; otherwise, a request with prompt_logprobs=true
will cause the request to fail with the message "Prefix caching with prompt logprobs not yet supported on VLLM V1."
The challenge of using prompt logprobs alongside APC is how to recover the topk prompt logprobs from an APC cache hit. The existing APC implementation does not cache prompt logprobs; upon a cache hit, cached blocks are treated as "computed" & no prompt logprobs are available for the computed blocks.
Proposed Choices for Implementation
- Use APC cached KVs to recompute prompt logprobs if a request with
prompt_logprobs=true
triggers an APC cache hit. This requires model code andmodel_executor
code to support re-running prefill using cached KVs. - Cache prompt logprobs in the APC. The problem with this solution is that a request which triggers an APC cache hit may require a greater number of topk prompt logprobs than the request which filled the cache, in which case recomputation would be necessary anyway.
- Bypass APC for requests with
prompt_logprobs=true
. Requests withprompt_logprobs=true
cannot exploit APC cache. This is the simplest solution but incurs a performance penalty.
Feedback Period.
One week from 2/17
CC List.
@robertgshaw2-redhat @WoosukKwon @njhill
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.