Skip to content

Comments

Support for token-in vLLM endpoint#626

Merged
mikasenghaas merged 84 commits intomainfrom
tok-in-out
Dec 16, 2025
Merged

Support for token-in vLLM endpoint#626
mikasenghaas merged 84 commits intomainfrom
tok-in-out

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Dec 12, 2025

Description

This PR implements integrates the custom token-in /v1/chat/completions/tokens endpoint from PRIME-RL's inference server (introdued in #1422) with verifiers so that PRIME-RL can do multi-turn RL without mismatches caused by retokenization.

The main changes are:

  • Make interleaved_rollouts (and any other extra env kwargs) configurable via vf-eval
  • If interleaved rollouts is configured, get_model_response will correctly set up prompt tokens, sampling args and the client to make a request to the custom endpoint

We decided on the following defaults for reliably building prompt tokens:

  • Use vLLM for tokenization (can scale API server capacity with --api-server-count to not get bottlenecked by tokenization)
  • Will tokenize the env_response in isolation and compute suffix tokens (tokens added in between messages by chat template, but not produced by LLM) once on dummy messages and cache for later usage. This should be safe in 99.9% of the cases.

Examples

Default behavior is unaffected, e.g. running math-python against OAI API

uv run vf-eval math-python -n1 -r1 -v 

To use the token-in prompt, start a custom vLLM server from PRIME-RL

uv run inference --model.name Qwen/Qwen3-4B-Instruct-2507 --enable-auto-tool-choice --tool-call-parser hermes --enable-log-requests
uv run vf-eval math-python -n1 -r1 -b http://localhost:8000/v1 -m Qwen/Qwen3-4B-Instruct-2507 -v -x '{"interleaved_rollouts": true}' 

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Adds interleaved rollouts using a vLLM token-in endpoint with pre-tokenized prompts, and introduces CLI-configurable extra environment kwargs applied at runtime.

  • Core (Environment):
    • Implement interleaved rollouts path in get_model_response using custom /v1/chat/completions/tokens with pre-tokenized prompt_ids and normalized sampling args.
    • Add overlong-prompt error handler decorator and refactor arg resolution/sampling normalization.
    • New setters: set_kwargs, set_interleaved_rollouts (with warning).
  • Token utilities:
    • New verifiers/utils/token_utils.py with tokenize_vllm, get_prompt_ids, and prepare_sampling_args_for_token_prompts (cached suffix handling, overlap logic, tokens client copy).
  • CLI/Config:
    • Add --extra-env-kwargs to vf-eval; plumb through EvalConfig.extra_env_kwargs and apply via vf_env.set_kwargs in run_evaluation.
  • EnvGroup:
    • Add set_interleaved_rollouts to propagate to sub-envs.
  • Types:
    • Make State.client and State.model required (non-optional).
  • Tests:
    • Update tests/test_eval_cli.py to include extra_env_kwargs arg and validate sampling args precedence.

Written by Cursor Bugbot for commit 75fa695. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas requested a review from snimu December 15, 2025 16:37
@mikasenghaas mikasenghaas marked this pull request as ready for review December 15, 2025 20:34
Copy link
Contributor

@snimu snimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great to me :)

@willccbb willccbb marked this pull request as draft December 15, 2025 22:48
@willccbb
Copy link
Member

@mikasenghaas Another thought -- I'm not sure how much sense it makes to have the tokenizer pool managed at the verifiers layer, seems spiritually similar to inference DP which is auto-managed by vLLM / prime-rl... ideally there is always only a single tokenizer endpoint available, and if replication is needed to manage load, this can be behind the endpoint

if config.extra_env_kwargs:
logger.info(f"Setting extra environment kwargs: {config.extra_env_kwargs}")
for k, v in config.extra_env_kwargs.items():
setattr(vf_env, k, v)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: EnvGroup sub-environments miss interleaved_rollouts propagation

Using setattr to set extra_env_kwargs bypasses the set_interleaved_rollouts method in EnvGroup. When an EnvGroup is loaded and interleaved_rollouts is set via extra_env_kwargs, only the group's attribute is updated, but sub-environments remain with interleaved_rollouts=False. Since EnvGroup.rollout() delegates to sub-environments, and each sub-environment's get_model_response checks its own self.interleaved_rollouts, the token-in feature silently won't work for EnvGroup environments.

Additional Locations (1)

Fix in Cursor Fix in Web

@lru_cache(maxsize=None)
def get_tokens_client(client: AsyncOpenAI) -> AsyncOpenAI:
logger.debug("Lazily copying OpenAI client for requests to /tokenize API")
url_without_v1 = str(client.base_url).replace("/v1/", "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: URL manipulation fails without trailing slash

The replace("/v1/", "") operation only works when the base URL includes a trailing slash after /v1. If a user configures their vLLM server with base_url="http://localhost:8000/v1" (no trailing slash), the replacement doesn't match and the URL remains unchanged. The tokenize request would then be sent to /v1/tokenize instead of /tokenize, causing the request to fail with a confusing 404 or routing error.

Fix in Cursor Fix in Web

cursor[bot]

This comment was marked as outdated.

@mikasenghaas mikasenghaas merged commit 4de7908 into main Dec 16, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants