Integrate token-in route with vLLM v0.12.0#1444
Merged
mikasenghaas merged 47 commits intoupgrade-vllmfrom Dec 17, 2025
Merged
Conversation
|
|
||
| if self.enable_lora: | ||
| self.api_server_count = 1 # LoRA requires only one API server | ||
| return self |
There was a problem hiding this comment.
Bug: LoRA forces incompatible API server count
auto_setup_api_server_count first enforces api_server_count >= parallel.dp but then unconditionally sets api_server_count = 1 when enable_lora is true. This can produce an inconsistent configuration (e.g., parallel.dp > 1 with only one API server), which likely breaks the intended DP setup and contradicts existing LoRA-in-DP support implied elsewhere.
Member
Author
There was a problem hiding this comment.
this is forced by vllm. but acc, should check, maybe its not an issue anymore with v0.12
samsja
pushed a commit
that referenced
this pull request
Dec 18, 2025
* dont use enum setter for logprobs mode * fix: stale imports * update to torch 2.9 * init_app_state doesnt take vllm config anymore somehow * use runnable because of CUDAGraphWrapper * vllm now uses default seed 0 * fix import * moe venv * use mjun flash attn for torch 2.9 and up vllm version * Revert "moe venv" This reverts commit 8934ceb. * remove some todos * remove unused import * Apply suggestions from code review Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com> * set very high max cpu loras to patch around areal lora hack * make flash attn optional and put uv sync extras everywhere * Integrate token-in route with vLLM v0.12.0 (#1444) * Move LoRA out of experimental section (#1440) * Add Bug Bot instructions for changelog enforcement (#1441) Co-authored-by: Cursor Agent <cursoragent@cursor.com> * duplicate chat completinos endpoint into /generate * serve chat with token in functionality * use field to avoid misleading warning * nicer error msg * lock feature branch * make use tokens prompt configurable * use setter and print info * bump * include inference * do not print warning log (logs all the time) * bump * bump + bring back warning log * bump vf * bump vf * use dp=6 in wordle example * no deepcopy and no warning * do not tokenize on the server * add field names so that tokens is cached and no warning of unrecognized field is shown * bump vf * auto install * bump vf * bump vf + set vllm tokenize method * skip applying chat template * Revert "skip applying chat template" This reverts commit 43c6a2b. * Revert "do not tokenize on the server" This reverts commit 9182191. * bring back log * use route /v1/chat/completions/tokens * fix log * bump vf and make everything configurable * bump and more informative log * bump and make non-exact tokenization default * use token prompts by default * remove retokenization issue from docs * rename class * bump vf * fix auto asc setup for lora * bump vf * bump vf * bump vf * bring back setter * bump vf * bump vf to latest prime-rl * make custom routes v0.12.0 compatible * monkey patch api server worker proc again to enable multi api server mode --------- Co-authored-by: will brown <williambrown97@gmail.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> --------- Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com> Co-authored-by: Mika Senghaas <mail@mikasenghaas.de> Co-authored-by: will brown <williambrown97@gmail.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contains recent changes on main + token-in PR (#1422) + migration to vLLM v0.12.0 for this endpoint and some general cleanup that should make future maintenance easier
Note
Adds a token-in chat completions endpoint to the vLLM server and promotes
model.experimental.loratomodel.lora, updating code, configs, and validations./v1/chat/completions/tokensendpoint withOpenAIServingChatWithTokenshandler and wiring (router, validation, streaming, error handling).init_app_stateto register token-in handler; delegate worker proc via vLLM’srun_api_server_worker_proc.InferenceConfig: auto-setapi_server_counttodpand force1whenenable_lora=true.trajectory_strategy="interleaved", enable token prompts in environments to avoid retokenization discrepancies.model.experimental.loratomodel.lora; removeExperimentalConfig; update all usages (validators, model setup, ckpt/weight broadcast paths) in RL and SFT trainers andrl.py.[trainer.model.lora]and related fields.verifiersrevision; minor example/config tweaks (env IDs, W&B names, parallel settings).model.loramove out of experimental.Written by Cursor Bugbot for commit 1e512cc. This will update automatically on new commits. Configure here.