Update max_num_tokens value when specdec is enabled#34671
Update max_num_tokens value when specdec is enabled#34671shaharmor98 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Shahar Mor <smor@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request correctly identifies the need to adjust max_num_tokens in GPUModelRunner when speculative decoding is active to prevent potential buffer overflows. The adjustment for parallel drafting appears correct. However, for the serial drafting case, the implementation only accounts for one extra token per sequence, which is insufficient if num_speculative_tokens is greater than one. I've provided a suggestion to correct this by consistently using num_speculative_tokens for the calculation, which will ensure the buffer is correctly sized for both serial and parallel drafting scenarios.
benchislett
left a comment
There was a problem hiding this comment.
Please add a short comment explaining why this is necessary. See my reply to the bot comment for context.
Otherwise LGTM
Purpose
Following the Unified Parallel Drafting PR, the
max_num_tokens field, ingpu_model_runner.pyhasn't been adjusted.When speculative decoding is enabled (draft model or Eagle), the scheduler increases
max_num_batched_tokensto account for speculative tokens, butGPUModelRunner.max_num_tokenswas not reflecting this adjustment. This could cause mismatches between what the scheduler sends and what the model runner expects.This fix increases
max_num_tokensbymax_num_seqs(ornum_speculative_tokens * max_num_seqswhen parallel drafting is used) when a draft model or Eagle speculative decoding is active.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.