Update max_num_tokens value when specdec is enabled by shaharmor98 · Pull Request #34671 · vllm-project/vllm

shaharmor98 · 2026-02-17T06:11:53Z

Purpose

Following the Unified Parallel Drafting PR, the max_num_tokens field, in gpu_model_runner.py hasn't been adjusted.
When speculative decoding is enabled (draft model or Eagle), the scheduler increases max_num_batched_tokens to account for speculative tokens, but GPUModelRunner.max_num_tokens was not reflecting this adjustment. This could cause mismatches between what the scheduler sends and what the model runner expects.

This fix increases max_num_tokens by max_num_seqs (or num_speculative_tokens * max_num_seqs when parallel drafting is used) when a draft model or Eagle speculative decoding is active.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Shahar Mor <smor@nvidia.com>

gemini-code-assist

Code Review

This pull request correctly identifies the need to adjust max_num_tokens in GPUModelRunner when speculative decoding is active to prevent potential buffer overflows. The adjustment for parallel drafting appears correct. However, for the serial drafting case, the implementation only accounts for one extra token per sequence, which is insufficient if num_speculative_tokens is greater than one. I've provided a suggestion to correct this by consistently using num_speculative_tokens for the calculation, which will ensure the buffer is correctly sized for both serial and parallel drafting scenarios.

vllm/v1/worker/gpu_model_runner.py

benchislett

Please add a short comment explaining why this is necessary. See my reply to the bot comment for context.

Otherwise LGTM

update max_num_tokens value when specdec is enabled

1aa8a08

Signed-off-by: Shahar Mor <smor@nvidia.com>

mergify bot added the v1 label Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

benchislett approved these changes Feb 17, 2026

View reviewed changes

LucasWilkinson approved these changes Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update max_num_tokens value when specdec is enabled#34671

Update max_num_tokens value when specdec is enabled#34671
shaharmor98 wants to merge 1 commit intovllm-project:mainfrom
shaharmor98:bugfix/fix-max-tokens-init-value

shaharmor98 commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

benchislett left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

shaharmor98 commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shaharmor98 commented Feb 17, 2026 •

edited by github-actions bot

Loading