Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the error in num_accepted_tokens calculation in reject sampler #2658

Closed

Conversation

sighingnow
Copy link
Contributor

The accepted mask vector [True, False, True, True], the num_accepted_tokens should be 1 instead of 3, as the last two tokens will be rejected as the 2-nd token is not accepted.

Thus, accepted.sum() is incorrect for num_accepted_tokens..

The reject sampler was merged in #2336, so @cadedaniel could you please help to confirm this bugfix here? Thanks!

Signed-off-by: Tao He <sighingnow@gmail.com>
@Yard1 Yard1 requested a review from cadedaniel January 30, 2024 12:34
@LiuXiaoxuanPKU
Copy link
Collaborator

LiuXiaoxuanPKU commented Jan 30, 2024

@LiuXiaoxuanPKU LiuXiaoxuanPKU self-assigned this Jan 30, 2024
@cadedaniel
Copy link
Collaborator

Hi @sighingnow and @LiuXiaoxuanPKU . Thanks for investigating this behavior and your interest in speculative decoding.

This is actually by design. To run speculative decoding in production, we want an empirical measurement of the quality of our draft model. We formulate this as the number of tokens acceptable by rejection sampling assuming their prefix is also accepted. This gives us a good real-world proxy for how aligned the draft and target models are.

For the top-level metric on how many tokens are emitted by speculative decoding, one should use num_emitted_tokens. This is what should be used to model the reduction in user-perceived inter-token-latency (ITL), as ITL is approximately step_time / num_tokens_per_step.

@sighingnow
Copy link
Contributor Author

num_accepted_tokens should be 4 here because 8 is a bonus token so it does not count.

Here, the 1 in the first sequence is a token generated by the target model, rather than drafted by the draft model. Note that in speculative decoding there is always at least 1 token to be generated even though all draft tokens are rejected.

For the top-level metric on how many tokens are emitted by speculative decoding, one should use num_emitted_tokens. This is what should be used to model the reduction in user-perceived inter-token-latency (ITL), as ITL is approximately step_time / num_tokens_per_step.

I see. Thanks for making it clear.

@sighingnow sighingnow closed this Jan 31, 2024
@sighingnow sighingnow deleted the ht/fixes-num-accepted-tokens branch January 31, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants