Conversation
b5c33f2 to
c56da9d
Compare
| if engine_prompts[0]["prompt_token_ids"] != request.tokens: | ||
| logger.warning( | ||
| "Prompt tokens provided in request do not match the engine prompt tokens. This may happen due to retokenization discrepancies in multi-turn conversations. Since you are using the /v1/chat/completions/tokens endpoint, we assume you want this behavior and use the provided prompt tokens. If this is undesired, use the standard /v1/chat/completions endpoint instead." | ||
| ) | ||
| logger.debug(f"engine_prompt_tokens:\n{engine_prompts[0]['prompt_token_ids']}") | ||
| logger.debug(f"request_tokens:\n{request.tokens}") | ||
|
|
||
| engine_prompts[0]["prompt_token_ids"] = request.tokens |
There was a problem hiding this comment.
if im understanding correctly, we process the chat completion request just to throw it away at the end and replace it with the passed in token ids?
what do we think about just directly using the passed token ids and ignoring the processing entirely? would make it simpler and we dont have to copy over the chat completion code every time we upgrade vllm.
There was a problem hiding this comment.
yea i acc had this implemented at commit 43c6a2b but decided against doing it because:
- it wasn't much faster
- i cannot print the warning logs (which are actually quite a nice sanity check, e.g. if this log is shown on every req, it probably means smth went wront in the pre-tokenization; we should get this log sometimes but not all the times
- i don't think we can ignore the processing entirely, e.g. the handler depends on having access to
conversationwhich requires applying the chat template anyways, so intercepting the engine prompt after all processing has been done seemed like the easiest fix that would handle all cases (e.g. don't have to handle partial processing, worry abt harmony code path, etc.)
if you have an idea on how to maybe no override the whole method tho but have more of a "monkey-patch" type behavior that'd be ideal. but also i feel like this might change in the new vesrion anyways bc i think they quite heavily refactored the api server, so thinks are looking quite diff either way and will likely have to figure it out in this new world anyways
…ed field is shown
7d5264a to
e627788
Compare
* add math python example * use ac * update rl instructinos * use 8 gpus and no ac * use offline filtering and val split because inference a bit too quick * 12k seq len, 2k max tokens, 300 steps * use hendrycks math with 512 tokens/turn * zero completion on error * log error rate * update math python to use vf version * fix arg * fix completion mask type * fix tests * handle empty trajectories case * add changelog * bump vf * log err rate and err distribution * fix instance * bump vf * fix dropna only on err col * also mask out first turn in interleaved mode * handle skipped rollouts
This reverts commit c1aa492.
| if handler is None: | ||
| return base(raw_request).create_error_response( | ||
| message="The model does not support Chat Completions API" | ||
| ) |
There was a problem hiding this comment.
Bug: Error response not wrapped in JSONResponse with status code
When handler is None, the code returns base(raw_request).create_error_response(...) directly, which returns an ErrorResponse Pydantic model. This is inconsistent with lines 191-192 which properly wrap ErrorResponse in JSONResponse(content=generator.model_dump(), status_code=generator.error.code). The direct return of ErrorResponse causes FastAPI to serialize it with a 200 OK status code instead of an appropriate error status code, making the error response appear successful to clients.
This PR implements a custom token-in chat completions endpoint to our vLLM inference server. The server now additionally exposes
/v1/chat/completions/tokens. It is basically copy of the regular/v1/chat/completionsendpoints with the difference that it takes requires that the request contains a fieldtokenswhich is the list of tokens which is used to build the engine prompt). This required two overrides:ChatCompletionRequestWithTokensextendsChatCompletionRequestwith atokensfieldOpenAIServingChatWithTokensextendsOpenAIServingChatby a methodcreate_chat_completion_with_tokens/v1/chat/completions/tokensIt also bumps
verifiersto a commit including #626, which integrates the token-in endpoint into the multi-turn rollout flow.Training Example
Before
After
In
wordleexample, we observe significant reductions in KL mismatchMinimal Example
GitHub Issue: #1421
Linear Issue: Resolves PRIMERL-243
Note
Adds a vLLM
/v1/chat/completions/tokensendpoint for token-in requests and auto-setsapi_server_count=1when LoRA is enabled; enables interleaved rollouts to use token prompts and bumpsverifiers./v1/chat/completions/tokensinsrc/prime_rl/inference/vllm/server.pyusingOpenAIServingChatWithTokensfromsrc/prime_rl/inference/vllm/serving_chat_with_tokens.py.src/prime_rl/inference/config.py: Auto-setapi_server_count=1whenenable_lorais true; otherwise ensure>= parallel.dp.CHANGELOG.md: Document LoRA API server limitation.src/prime_rl/orchestrator/orchestrator.py: Fortrajectory_strategy="interleaved", enable token prompts viaenv.set_interleaved_rollouts(True).docs/trajectories.md: Update trajectory guidance; remove retokenization section, clarify chat template behavior.configs/alphabet_sort/rl.toml: Add W&B block; update envid; minor structure tweaks.examples/wordle/rl.toml: Setinference.parallel.dp = 6.pyproject.toml/uv.lock: Bumpverifierstoca75d04(v0.1.8.post2).Written by Cursor Bugbot for commit beb2c29. This will update automatically on new commits. Configure here.