Skip to content

Comments

Integrate token-in route with vLLM v0.12.0#1444

Merged
mikasenghaas merged 47 commits intoupgrade-vllmfrom
upgrade-vllm-with-tok-in
Dec 17, 2025
Merged

Integrate token-in route with vLLM v0.12.0#1444
mikasenghaas merged 47 commits intoupgrade-vllmfrom
upgrade-vllm-with-tok-in

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Dec 17, 2025

Contains recent changes on main + token-in PR (#1422) + migration to vLLM v0.12.0 for this endpoint and some general cleanup that should make future maintenance easier


Note

Adds a token-in chat completions endpoint to the vLLM server and promotes model.experimental.lora to model.lora, updating code, configs, and validations.

  • Inference (vLLM server):
    • Add /v1/chat/completions/tokens endpoint with OpenAIServingChatWithTokens handler and wiring (router, validation, streaming, error handling).
    • Patch init_app_state to register token-in handler; delegate worker proc via vLLM’s run_api_server_worker_proc.
    • Adjust InferenceConfig: auto-set api_server_count to dp and force 1 when enable_lora=true.
  • Orchestrator:
    • When trajectory_strategy="interleaved", enable token prompts in environments to avoid retokenization discrepancies.
  • Trainer/Config Refactor:
    • Move model.experimental.lora to model.lora; remove ExperimentalConfig; update all usages (validators, model setup, ckpt/weight broadcast paths) in RL and SFT trainers and rl.py.
    • Update CI/example TOMLs to [trainer.model.lora] and related fields.
  • Docs:
    • Update trajectories guidance; streamline around chat template behavior.
  • Dependencies/Configs:
    • Bump verifiers revision; minor example/config tweaks (env IDs, W&B names, parallel settings).
  • Changelog:
    • Document model.lora move out of experimental.

Written by Cursor Bugbot for commit 1e512cc. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas changed the base branch from main to upgrade-vllm December 17, 2025 16:41
@mikasenghaas mikasenghaas marked this pull request as ready for review December 17, 2025 16:57

if self.enable_lora:
self.api_server_count = 1 # LoRA requires only one API server
return self
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: LoRA forces incompatible API server count

auto_setup_api_server_count first enforces api_server_count >= parallel.dp but then unconditionally sets api_server_count = 1 when enable_lora is true. This can produce an inconsistent configuration (e.g., parallel.dp > 1 with only one API server), which likely breaks the intended DP setup and contradicts existing LoRA-in-DP support implied elsewhere.

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is forced by vllm. but acc, should check, maybe its not an issue anymore with v0.12

@mikasenghaas mikasenghaas merged commit 4051ad9 into upgrade-vllm Dec 17, 2025
10 checks passed
samsja pushed a commit that referenced this pull request Dec 18, 2025
* dont use enum setter for logprobs mode

* fix: stale imports

* update to torch 2.9

* init_app_state doesnt take vllm config anymore somehow

* use runnable because of CUDAGraphWrapper

* vllm now uses default seed 0

* fix import

* moe venv

* use mjun flash attn for torch 2.9 and up vllm version

* Revert "moe venv"

This reverts commit 8934ceb.

* remove some todos

* remove unused import

* Apply suggestions from code review

Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>

* set very high max cpu loras to patch around areal lora hack

* make flash attn optional and put uv sync extras everywhere

* Integrate token-in route with vLLM v0.12.0 (#1444)

* Move LoRA out of experimental section (#1440)

* Add Bug Bot instructions for changelog enforcement (#1441)

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* duplicate chat completinos endpoint into /generate

* serve chat with token in functionality

* use field to avoid misleading warning

* nicer error msg

* lock feature branch

* make use tokens prompt configurable

* use setter and print info

* bump

* include inference

* do not print warning log (logs all the time)

* bump

* bump + bring back warning log

* bump vf

* bump vf

* use dp=6 in wordle example

* no deepcopy and no warning

* do not tokenize on the server

* add field names so that tokens is cached and no warning of unrecognized field is shown

* bump vf

* auto install

* bump vf

* bump vf + set vllm tokenize method

* skip applying chat template

* Revert "skip applying chat template"

This reverts commit 43c6a2b.

* Revert "do not tokenize on the server"

This reverts commit 9182191.

* bring back log

* use route /v1/chat/completions/tokens

* fix log

* bump vf and make everything configurable

* bump and more informative log

* bump and make non-exact tokenization default

* use token prompts by default

* remove retokenization issue from docs

* rename class

* bump vf

* fix auto asc setup for lora

* bump vf

* bump vf

* bump vf

* bring back setter

* bump vf

* bump vf to latest prime-rl

* make custom routes v0.12.0 compatible

* monkey patch api server worker proc again to enable multi api server mode

---------

Co-authored-by: will brown <williambrown97@gmail.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>

---------

Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
Co-authored-by: Mika Senghaas <mail@mikasenghaas.de>
Co-authored-by: will brown <williambrown97@gmail.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants