Token-in vLLM endpoint by mikasenghaas · Pull Request #1422 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2025-12-12T14:20:19Z

This PR implements a custom token-in chat completions endpoint to our vLLM inference server. The server now additionally exposes /v1/chat/completions/tokens. It is basically copy of the regular /v1/chat/completions endpoints with the difference that it takes requires that the request contains a field tokens which is the list of tokens which is used to build the engine prompt). This required two overrides:

ChatCompletionRequestWithTokens extends ChatCompletionRequest with a tokens field
OpenAIServingChatWithTokens extends OpenAIServingChat by a method create_chat_completion_with_tokens
The endpoint is registered at /v1/chat/completions/tokens

Note, that the inference server redundantly tokenizes the inputs at the moment. There exists a commit in this PR which skips this tokenization but I did not see this speeding up anything. I decided to remove it again because the current override is very light-weight/ unintrusive and will be esay to maintain in the future.

It also bumps verifiers to a commit including #626, which integrates the token-in endpoint into the multi-turn rollout flow.

Training Example

uv run rl @ examples/alphabet_sort/rl.toml --max-steps 50

Before

After

In wordle example, we observe significant reductions in KL mismatch

Minimal Example

uv run inference --model.name Qwen/Qwen3-4B-Instruct-2507

from typing import cast

from httpx import Client
from openai import OpenAI
from openai.types.chat import ChatCompletion
from openai.types.chat.chat_completion_message_param import ChatCompletionMessageParam
from transformers import AutoTokenizer, PreTrainedTokenizer

model_name = "Qwen/Qwen3-4B-Instruct-2507"
add_generation_prompt = True
tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(model_name)
base_url = "http://localhost:8000"
oai_client = OpenAI(base_url=f"{base_url}/v1")
client = Client(base_url=base_url)

prompt: list[ChatCompletionMessageParam] = [{"role": "user", "content": "Hello, how are you?"}]

# Get prompt tokens from local tokenizer
local_prompt_tokens = tokenizer.apply_chat_template(cast(list[dict[str, str]], prompt), add_generation_prompt=True)
print(f"✅ Got prompt tokens from local tokenizer:\n{local_prompt_tokens}")

# Ensure server is healthy
assert client.get(f"{base_url}/health").status_code == 200
print("✅ Checked server health")

# Get prompt tokens from remote vLLM server via client
response = client.post(
    f"{base_url}/tokenize",
    json={"model": model_name, "messages": prompt, "add_generation_prompt": add_generation_prompt},
)
# print(response.json())
remote_prompt_tokens = response.json()["tokens"]
print(f"✅ Got prompt tokens from remote vLLM server:\n{remote_prompt_tokens}")

# Get prompt/completion tokens via chat completions API
extra_body = dict(return_token_ids=True, prompt_logprobs=True)
chat_completion_args = dict(
    model=model_name, messages=prompt, max_tokens=1024, temperature=0.0, logprobs=True, extra_body=extra_body
)
response = oai_client.chat.completions.create(**chat_completion_args)  # type: ignore
response_dict = response.model_dump()
chat_completion_prompt_tokens = response_dict["prompt_token_ids"]
chat_completion_tokens = response_dict["choices"][0]["token_ids"]
print(
    f"✅ Got response from /v1/chat/completions\nPrompt: {chat_completion_prompt_tokens}\nCompletion: {chat_completion_tokens}"
)


# Get prompt/completion tokens via chat completion w/ tokens API
extra_body = dict(return_token_ids=True, prompt_logprobs=True, tokens=local_prompt_tokens)
# IMPORTANT: We need to merge the extra_body into the request args (normally this happens inside the OAI client) for vLLM to have access to it
chat_completion_args.pop("extra_body")
chat_completion_with_tokens_args = {**chat_completion_args, **extra_body}
print(chat_completion_with_tokens_args)
response = oai_client.post("/chat/completions/tokens", body=chat_completion_with_tokens_args, cast_to=ChatCompletion)
response_dict = response.model_dump()
generate_prompt_tokens = response_dict["prompt_token_ids"]
generate_completion_tokens = response_dict["choices"][0]["token_ids"]
print(f"✅ Got response from /generate\nPrompt: {generate_prompt_tokens}\nCompletion: {generate_completion_tokens}")

assert local_prompt_tokens == remote_prompt_tokens == chat_completion_prompt_tokens == generate_prompt_tokens, (
    "Prompt tokens do not match"
)
print(
    "✅ Prompt and completion tokens match between local tokenizer, vLLM /tokenize, OpenAI /chat/completions, and vLLM /generate"
)

assert chat_completion_tokens == generate_completion_tokens, "Completion tokens do not match"
print("✅ Completion tokens match between OpenAI /chat/completions and vLLM /generate")

# New request
new_prompt: list[ChatCompletionMessageParam] = [{"role": "user", "content": "What is the capital of France?"}]
tokens = tokenizer.apply_chat_template(cast(list[dict[str, str]], new_prompt), add_generation_prompt=True)
response = oai_client.post(
    "/chat/completions/tokens", body={**chat_completion_with_tokens_args, "tokens": tokens}, cast_to=ChatCompletion
)
print(response.choices[0].message.content)

GitHub Issue: #1421
Linear Issue: Resolves PRIMERL-243

Note

Adds a vLLM /v1/chat/completions/tokens endpoint for token-in requests and auto-sets api_server_count=1 when LoRA is enabled; enables interleaved rollouts to use token prompts and bumps verifiers.

Inference Server (vLLM):
- New endpoint: Serve token-in chat completions at /v1/chat/completions/tokens in src/prime_rl/inference/vllm/server.py using OpenAIServingChatWithTokens from src/prime_rl/inference/vllm/serving_chat_with_tokens.py.
- Registers route with validation/streaming, loads chat template, and integrates with app state.
Config/Runtime:
- src/prime_rl/inference/config.py: Auto-set api_server_count=1 when enable_lora is true; otherwise ensure >= parallel.dp.
- CHANGELOG.md: Document LoRA API server limitation.
Orchestrator:
- src/prime_rl/orchestrator/orchestrator.py: For trajectory_strategy="interleaved", enable token prompts via env.set_interleaved_rollouts(True).
Docs:
- docs/trajectories.md: Update trajectory guidance; remove retokenization section, clarify chat template behavior.
Examples/Configs:
- configs/alphabet_sort/rl.toml: Add W&B block; update env id; minor structure tweaks.
- examples/wordle/rl.toml: Set inference.parallel.dp = 6.
Dependencies:
- pyproject.toml/uv.lock: Bump verifiers to ca75d04 (v0.1.8.post2).

^{Written by Cursor Bugbot for commit beb2c29. This will update automatically on new commits. Configure here.}

src/prime_rl/inference/vllm/serving_chat_with_tokens.py

Jackmin801 · 2025-12-17T03:05:09Z

src/prime_rl/inference/vllm/serving_chat_with_tokens.py

+        if engine_prompts[0]["prompt_token_ids"] != request.tokens:
+            logger.warning(
+                "Prompt tokens provided in request do not match the engine prompt tokens. This may happen due to retokenization discrepancies in multi-turn conversations. Since you are using the /v1/chat/completions/tokens endpoint, we assume you want this behavior and use the provided prompt tokens. If this is undesired, use the standard /v1/chat/completions endpoint instead."
+            )
+            logger.debug(f"engine_prompt_tokens:\n{engine_prompts[0]['prompt_token_ids']}")
+            logger.debug(f"request_tokens:\n{request.tokens}")
+
+        engine_prompts[0]["prompt_token_ids"] = request.tokens


if im understanding correctly, we process the chat completion request just to throw it away at the end and replace it with the passed in token ids?

what do we think about just directly using the passed token ids and ignoring the processing entirely? would make it simpler and we dont have to copy over the chat completion code every time we upgrade vllm.

yea i acc had this implemented at commit 43c6a2b but decided against doing it because:

it wasn't much faster

i cannot print the warning logs (which are actually quite a nice sanity check, e.g. if this log is shown on every req, it probably means smth went wront in the pre-tokenization; we should get this log sometimes but not all the times

i don't think we can ignore the processing entirely, e.g. the handler depends on having access to conversation which requires applying the chat template anyways, so intercepting the engine prompt after all processing has been done seemed like the easiest fix that would handle all cases (e.g. don't have to handle partial processing, worry abt harmony code path, etc.)
if you have an idea on how to maybe no override the whole method tho but have more of a "monkey-patch" type behavior that'd be ideal. but also i feel like this might change in the new vesrion anyways bc i think they quite heavily refactored the api server, so thinks are looking quite diff either way and will likely have to figure it out in this new world anyways

…ed field is shown

This reverts commit 43c6a2b.

This reverts commit 9182191.

src/prime_rl/inference/config.py

samsja

lgtm

* add math python example * use ac * update rl instructinos * use 8 gpus and no ac * use offline filtering and val split because inference a bit too quick * 12k seq len, 2k max tokens, 300 steps * use hendrycks math with 512 tokens/turn * zero completion on error * log error rate * update math python to use vf version * fix arg * fix completion mask type * fix tests * handle empty trajectories case * add changelog * bump vf * log err rate and err distribution * fix instance * bump vf * fix dropna only on err col * also mask out first turn in interleaved mode * handle skipped rollouts

This reverts commit c1aa492.

cursor · 2025-12-17T22:42:40Z

src/prime_rl/inference/vllm/server.py

+            if handler is None:
+                return base(raw_request).create_error_response(
+                    message="The model does not support Chat Completions API"
+                )


Bug: Error response not wrapped in JSONResponse with status code

When handler is None, the code returns base(raw_request).create_error_response(...) directly, which returns an ErrorResponse Pydantic model. This is inconsistent with lines 191-192 which properly wrap ErrorResponse in JSONResponse(content=generator.model_dump(), status_code=generator.error.code). The direct return of ErrorResponse causes FastAPI to serialize it with a 200 OK status code instead of an appropriate error status code, making the error response appear successful to clients.

mikasenghaas mentioned this pull request Dec 12, 2025

Support for token-in vLLM endpoint PrimeIntellect-ai/verifiers#626

Merged

13 tasks

mikasenghaas force-pushed the tok-in-out branch 4 times, most recently from b5c33f2 to c56da9d Compare December 15, 2025 07:54

mikasenghaas requested a review from samsja December 15, 2025 16:37

mikasenghaas marked this pull request as ready for review December 16, 2025 21:35

cursor bot reviewed Dec 16, 2025

View reviewed changes

src/prime_rl/inference/vllm/serving_chat_with_tokens.py Show resolved Hide resolved

Jackmin801 reviewed Dec 17, 2025

View reviewed changes

mikasenghaas added 21 commits December 17, 2025 07:47

duplicate chat completinos endpoint into /generate

2ca61f8

serve chat with token in functionality

6eeb4e1

use field to avoid misleading warning

74821ea

nicer error msg

02485a1

lock feature branch

2d0bcab

make use tokens prompt configurable

429a462

use setter and print info

31f111b

bump

2666f41

include inference

dc45983

do not print warning log (logs all the time)

08b13d8

bump

eda2348

bump + bring back warning log

d2f530d

bump vf

5a41f94

bump vf

162f05f

use dp=6 in wordle example

84c00cb

no deepcopy and no warning

9a0fc7d

do not tokenize on the server

8fdd5e7

add field names so that tokens is cached and no warning of unrecogniz…

2b4052b

…ed field is shown

bump vf

7725066

auto install

e815f44

bump vf

eb99034

mikasenghaas added 20 commits December 17, 2025 07:47

skip applying chat template

a339566

Revert "skip applying chat template"

bcd1ac4

This reverts commit 43c6a2b.

Revert "do not tokenize on the server"

035bcda

This reverts commit 9182191.

bring back log

2c0bd59

use route /v1/chat/completions/tokens

675a772

fix log

dbebae9

bump vf and make everything configurable

44c493d

bump and more informative log

8ef375b

bump and make non-exact tokenization default

ddfb878

use token prompts by default

09355e4

remove retokenization issue from docs

21b85e6

rename class

8f7a090

bump vf

8e0c984

fix auto asc setup for lora

4ce7321

bump vf

17dd9ad

bump vf

6104f53

bump vf

3254f53

bring back setter

6a50b19

bump vf

4961e4d

bump vf to latest prime-rl

e627788

mikasenghaas force-pushed the tok-in-out branch from 7d5264a to e627788 Compare December 17, 2025 07:47

cursor bot reviewed Dec 17, 2025

View reviewed changes

src/prime_rl/inference/config.py Show resolved Hide resolved

samsja approved these changes Dec 17, 2025

View reviewed changes

mikasenghaas mentioned this pull request Dec 17, 2025

Integrate token-in route with vLLM v0.12.0 #1444

Merged

mikasenghaas added 4 commits December 17, 2025 14:21

Revert "Explicitly catch env errors (#1416)"

8ccf327

This reverts commit c1aa492.

bump vf

14c77d3

add asc comment

beb2c29

mikasenghaas merged commit 8dccae6 into main Dec 17, 2025
6 checks passed

cursor bot reviewed Dec 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Token-in vLLM endpoint#1422

Token-in vLLM endpoint#1422
mikasenghaas merged 46 commits intomainfrom
tok-in-out

mikasenghaas commented Dec 12, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Jackmin801 Dec 17, 2025

Uh oh!

mikasenghaas Dec 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

samsja left a comment

Uh oh!

Uh oh!

cursor bot Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

mikasenghaas commented Dec 12, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Training Example

Minimal Example

Uh oh!

Uh oh!

Jackmin801 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samsja left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Dec 17, 2025

Choose a reason for hiding this comment

Bug: Error response not wrapped in JSONResponse with status code

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented Dec 12, 2025 •

edited by cursor bot

Loading

mikasenghaas Dec 17, 2025 •

edited

Loading