Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty) #9807

Open
1 task done
melisa-writer opened this issue Oct 29, 2024 · 3 comments
Open
1 task done

Comments

@melisa-writer
Copy link

melisa-writer commented Oct 29, 2024

🚀 The feature, motivation and pitch

Writer has introduced "Writing in the Margins" algorithm (WiM) that
boosts results for long context window retrieval. The task is composed from "context" and "query" that is put at the end.

The basic idea is to generate additional text while doing chunked prefill. The extra decoding step does not contribute to the KV-cache prefilling. The text is later concatenated and added to the final chunk.

There exists a pure HuggingFace transformers implementation: https://github.com/writer/writing-in-the-margins

This is a high level overview of the inference pattern:
Screenshot 2024-10-29 at 18 33 22

And this is more detailed explanation how to do it efficiently by batch generation and prefill requests.
Screenshot 2024-10-29 at 18 34 19

The algorithm itself:
Screenshot 2024-10-29 at 18 39 05

The expected solution can be a feature added to vllm or a vllm fork, we are happy to maintain it.
The WiM solution assumes extra input preprocessing steps (nltk splitting) and variable chunk size for chunked prefill, but those details can be left out from the solution.

We offer $5,000 bounty for the main contributor (but the bounty can be shared if there is more than one developer involved).

paper: ArXiv
press coverage:

Alternatives

No response

Additional context

Github: https://github.com/writer/writing-in-the-margins

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@noooop
Copy link
Contributor

noooop commented Oct 31, 2024

That's a brilliant idea.

I think

This algorithm can be implemented efficiently through prefix cache, and vllm supports prefix cache.

So you can implement this algorithm without modifying any vllm code.

specific

  1. start vllm with prefix cache on
  2. submit Margin generation requests in order
  3. immediately submit the next Margin generation request when the previous request returns the first token.

How do I get $5,000? (joke)

@melisa-writer
Copy link
Author

Thank you!

Indeed Automatic Prefix Caching could be used to simulate WiM algorithm. However there are a few issues related with this:

  1. Different tokenization:

The real issue is the different tokenization you get by sending the text "A gentle breeze stirred" and then "the leaves as children" or the text "A gentle breeze stirred the leaves as children". To really apply WiM by exploiting Prefix Caching, you need to send multiple requests trimmed at the exact point in which you want the tokenizer to be break the text, but this would mean sending multiple requests to vLLM (which is what you do when you "simulate it"), which would result in much longer time to process compared to prefill-generate-prefill-generate done in the same request.

  1. High workload KV cache eviction:

In a high workload scenario, the KV-Cache would be evicted, and in case of cache-miss it would be re-created again. But is it a problem? It may be. The whole point of WiM is to reuse the partially prefilled KV-Cache (or in other terms, KV prefixes).

  1. Storing customer data:
    Automatic prefix caching means user data is stored in RAM. We would rather turn that feature off due to compliance issues.

To summarise: Good point with Automatic Prefix Caching, it can be used for prototyping. We still need something different for production use case.

@noooop
Copy link
Contributor

noooop commented Nov 1, 2024

Use prompt_token_ids as input can bypass the tokenizer.

The previous Margin generation is still in progress, so KV-Cache is still on the GPU. You cannot miss it.

Prefix Caching is almost a perfect solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants