Skip to content

Update max_num_tokens value when specdec is enabled#34671

Open
shaharmor98 wants to merge 1 commit intovllm-project:mainfrom
shaharmor98:bugfix/fix-max-tokens-init-value
Open

Update max_num_tokens value when specdec is enabled#34671
shaharmor98 wants to merge 1 commit intovllm-project:mainfrom
shaharmor98:bugfix/fix-max-tokens-init-value

Conversation

@shaharmor98
Copy link
Contributor

@shaharmor98 shaharmor98 commented Feb 17, 2026

Purpose

Following the Unified Parallel Drafting PR, the max_num_tokens field, in gpu_model_runner.py hasn't been adjusted.
When speculative decoding is enabled (draft model or Eagle), the scheduler increases max_num_batched_tokens to account for speculative tokens, but GPUModelRunner.max_num_tokens was not reflecting this adjustment. This could cause mismatches between what the scheduler sends and what the model runner expects.

This fix increases max_num_tokens by max_num_seqs (or num_speculative_tokens * max_num_seqs when parallel drafting is used) when a draft model or Eagle speculative decoding is active.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Shahar Mor <smor@nvidia.com>
@mergify mergify bot added the v1 label Feb 17, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies the need to adjust max_num_tokens in GPUModelRunner when speculative decoding is active to prevent potential buffer overflows. The adjustment for parallel drafting appears correct. However, for the serial drafting case, the implementation only accounts for one extra token per sequence, which is insufficient if num_speculative_tokens is greater than one. I've provided a suggestion to correct this by consistently using num_speculative_tokens for the calculation, which will ensure the buffer is correctly sized for both serial and parallel drafting scenarios.

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short comment explaining why this is necessary. See my reply to the bot comment for context.

Otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants