You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Writer has introduced "Writing in the Margins" algorithm (WiM) that
boosts results for long context window retrieval. The task is composed from "context" and "query" that is put at the end.
The basic idea is to generate additional text while doing chunked prefill. The extra decoding step does not contribute to the KV-cache prefilling. The text is later concatenated and added to the final chunk.
This is a high level overview of the inference pattern:
And this is more detailed explanation how to do it efficiently by batch generation and prefill requests.
The algorithm itself:
The expected solution can be a feature added to vllm or a vllm fork, we are happy to maintain it.
The WiM solution assumes extra input preprocessing steps (nltk splitting) and variable chunk size for chunked prefill, but those details can be left out from the solution.
We offer $5,000 bounty for the main contributor (but the bounty can be shared if there is more than one developer involved).
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Indeed Automatic Prefix Caching could be used to simulate WiM algorithm. However there are a few issues related with this:
Different tokenization:
The real issue is the different tokenization you get by sending the text "A gentle breeze stirred" and then "the leaves as children" or the text "A gentle breeze stirred the leaves as children". To really apply WiM by exploiting Prefix Caching, you need to send multiple requests trimmed at the exact point in which you want the tokenizer to be break the text, but this would mean sending multiple requests to vLLM (which is what you do when you "simulate it"), which would result in much longer time to process compared to prefill-generate-prefill-generate done in the same request.
High workload KV cache eviction:
In a high workload scenario, the KV-Cache would be evicted, and in case of cache-miss it would be re-created again. But is it a problem? It may be. The whole point of WiM is to reuse the partially prefilled KV-Cache (or in other terms, KV prefixes).
Storing customer data:
Automatic prefix caching means user data is stored in RAM. We would rather turn that feature off due to compliance issues.
To summarise: Good point with Automatic Prefix Caching, it can be used for prototyping. We still need something different for production use case.
🚀 The feature, motivation and pitch
Writer has introduced "Writing in the Margins" algorithm (WiM) that
boosts results for long context window retrieval. The task is composed from "context" and "query" that is put at the end.
The basic idea is to generate additional text while doing chunked prefill. The extra decoding step does not contribute to the KV-cache prefilling. The text is later concatenated and added to the final chunk.
There exists a pure HuggingFace transformers implementation: https://github.com/writer/writing-in-the-margins
This is a high level overview of the inference pattern:
And this is more detailed explanation how to do it efficiently by batch generation and prefill requests.
The algorithm itself:
The expected solution can be a feature added to vllm or a vllm fork, we are happy to maintain it.
The WiM solution assumes extra input preprocessing steps (nltk splitting) and variable chunk size for chunked prefill, but those details can be left out from the solution.
We offer $5,000 bounty for the main contributor (but the bounty can be shared if there is more than one developer involved).
paper: ArXiv
press coverage:
Alternatives
No response
Additional context
Github: https://github.com/writer/writing-in-the-margins
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: