Skip to content

[RFC]: [Spec Decode] Combine Ngram and EAGLE #18633

@ekagra-ranjan

Description

@ekagra-ranjan

Motivation.

EAGLE and Ngram give speedup in different scenarios.

  • Ngram is beneficial when the prompt has ngrams which can be useful for creating drafts, e.g., editing task
  • EAGLE is a drafting strategy which can propose draft tokens in any general situation

Right now, we can either start vllm serve either with ngram OR with EAGLE. We need to decide beforehand which draft strategy we are going to use. This means we have to deploy different model instances in production if we want to support editing request and general requests. This leads to less than optimal GPU resource utilization since traffic for editing and non-editing task will not be static

Proposed Change.

If we combine ngram and EAGLE then we don't need separate instances of ngram and EAGLE deployments and the instance can pick which strategy to use at any given timestep for a given request. Ngram is a model-free drafter whereas EAGLE needs a draft model so the overhead of running them will not be too much over EAGLE's overhead. It could improve overall AL compared to just using EAGLE.

Draft merge strategy

For any given timestep and for a given seq in a batch:

  • find if there any ngram match
    • If yes, then we let these be the draft token and skip EAGLE for this seq
    • If no, then we put them to a batch to be processed by EAGLE
  • run EAGLE if there is any seq that needs it
  • verify the proposed tokens as usual using target model and Rejection Sampler

The key insight is that the optimal K for EAGLE we have found in practice is ~3 (src1, src2), i.e., EAGLE proposes max 3 tokens at a time. Ngram usually is with 5 but can be set higher. If the prompt_lookup_max and prompt_lookup_min are set correctly, an ngram match will give us the signal that the current sequence/dataset at the current step has a good chance of benefiting from ngram lookup. If no match is found then we defer it to EAGLE to find us the draft. The merged strategy will be a different method name like ngram-eagle and will not change default behavior of ngram or eagle.

Feedback Period.

1 week

CC List.

@LiuXiaoxuanPKU @WoosukKwon

Any Other Things.

One of the challenge is the impact on torch.compile and cugraph since the number of draft can be different per seq during verification step since K=~3 for EAGLE and K>=5 for Ngram. I will see how it works out but feel free to share if anyone has any thoughts.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions