-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Motivation.
EAGLE and Ngram give speedup in different scenarios.
- Ngram is beneficial when the prompt has ngrams which can be useful for creating drafts, e.g., editing task
- EAGLE is a drafting strategy which can propose draft tokens in any general situation
Right now, we can either start vllm serve either with ngram OR with EAGLE. We need to decide beforehand which draft strategy we are going to use. This means we have to deploy different model instances in production if we want to support editing request and general requests. This leads to less than optimal GPU resource utilization since traffic for editing and non-editing task will not be static
Proposed Change.
If we combine ngram and EAGLE then we don't need separate instances of ngram and EAGLE deployments and the instance can pick which strategy to use at any given timestep for a given request. Ngram is a model-free drafter whereas EAGLE needs a draft model so the overhead of running them will not be too much over EAGLE's overhead. It could improve overall AL compared to just using EAGLE.
Draft merge strategy
For any given timestep and for a given seq in a batch:
- find if there any ngram match
- If yes, then we let these be the draft token and skip EAGLE for this seq
- If no, then we put them to a batch to be processed by EAGLE
- run EAGLE if there is any seq that needs it
- verify the proposed tokens as usual using target model and Rejection Sampler
The key insight is that the optimal K for EAGLE we have found in practice is ~3 (src1, src2), i.e., EAGLE proposes max 3 tokens at a time. Ngram usually is with 5 but can be set higher. If the prompt_lookup_max
and prompt_lookup_min
are set correctly, an ngram match will give us the signal that the current sequence/dataset at the current step has a good chance of benefiting from ngram lookup. If no match is found then we defer it to EAGLE to find us the draft. The merged strategy will be a different method name like ngram-eagle
and will not change default behavior of ngram
or eagle
.
Feedback Period.
1 week
CC List.
Any Other Things.
One of the challenge is the impact on torch.compile and cugraph since the number of draft can be different per seq during verification step since K=~3 for EAGLE and K>=5 for Ngram. I will see how it works out but feel free to share if anyone has any thoughts.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.