Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookahead decoding | forking + "appending" to child sequences. #1970

Closed
priyamtejaswin opened this issue Dec 7, 2023 · 5 comments
Closed

Comments

@priyamtejaswin
Copy link

Hi, @WoosukKwon and @zhuohan123 ,

Fantastic project!

I was taking a stab at implementing a version of greedy lookahead-decoding. Given some candidate completions, I was trying to:

  1. Fork children from the parent sequence
  2. Append new tokens (from the candidates) to the children sequences
  3. Call step in the engine to parallelize the next token prediction across candidates
  4. Verify and select the longest prefix
  5. Append this to the parent sequence
  6. Discard all children sequences created in step#1

I had a question about the behavior of Sequence.append_token_id, and its implications on the future engine steps.

vllm/vllm/sequence.py

Lines 159 to 167 in 24f60a5

def append_token_id(
self,
token_id: int,
logprobs: Dict[int, float],
) -> None:
assert token_id in logprobs
self._append_tokens_to_blocks([token_id])
self.output_logprobs.append(logprobs)
self.data.append_token_id(token_id, logprobs[token_id])

From the looks of it, if I append a token here, it should add the token to the appropriate blocks.

But when I try this in practice, I get a different output. Suppose the LLM was generating

912 -> 442 -> 42

I intervene after it has generated 912, and append 442 using .append_token_id , and then call step(). But I see

912 -> 442 -> 10

Seeding is not the problem -- I have accounted for that.

Tagging some folks who had previously participated in lookahead/speculative decoding discussions.
@simon-mo @LiuXiaoxuanPKU @skrider @beginlner

@priyamtejaswin
Copy link
Author

I found the problem. This is fixed.

I'll leave the issue open in case someone has thoughts on this approach. Now that the outputs are matching (with and without the interventions) I'll try to finish a draft and see if it works.

@learning-chip
Copy link

I'll try to finish a draft and see if it works.

Hi @priyamtejaswin do you have a draft already? I am also interested in taking this further (especially Lookahead + PagedAttention)

@cadedaniel
Copy link
Collaborator

I believe once #2188 is merged you can add Lookahead as the proposer, since verification of tokens is the same.

@creatorrr
Copy link

lookahead decoding supports flash attention and sampling both now.

https://github.com/hao-ai-lab/LookaheadDecoding

@hmellor
Copy link
Collaborator

hmellor commented Aug 28, 2024

Closing this as a duplicate of #1742.

The work @cadedaniel mentioned has been completed and the discussion for this feature is more active in the issue I linked above.

@hmellor hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants