RLM: Fix trajectory collision by snimu · Pull Request #786 · PrimeIntellect-ai/verifiers

snimu · 2026-01-25T19:10:09Z

Description

Fixes interleaved prompt‑ID computation in RLMEnv (root cause of /tokenize empty‑messages errors). Sub‑LLM calls now go through Environment.get_model_response using a fake state with an empty trajectory and mirrored sampling_args/oai_tools, so prompt‑id computation can’t be polluted. The old /chat/completions/tokens path and explicit tokenization are gone for sub‑LLMs
include_sub_llm_in_trajectory default is False, and interleaving is explicitly disallowed when it’s True (guards in set_interleaved_rollouts and setup_state)
llm_batch is now strings‑only (enforced + documented). Non‑string prompts return an error message
Removed sub‑LLM message normalization and sampling‑arg normalization helpers; sampling args are now just defensively copied and normalization is handled by get_model_response
RLM‑secrets dataset updates: add example_id/task, remove eval dataset creation, README updated with eval guidance

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Resolves sub-LLM rollout collisions and simplifies request paths; tightens llm_batch usage; and streamlines the rlm_secrets environment.

Sub-LLM calls now always go through get_model_response (chat path) with a fake state, ignoring interleaving; removed /chat/completions/tokens and explicit tokenization. Sampling args are mirrored; message/sampling normalization moved to module-level helpers.
include_sub_llm_in_trajectory defaults to False. Interleaved rollouts are explicitly disallowed when it’s True (guards in set_interleaved_rollouts and setup_state). Sub-LLM steps can be added to the trajectory with extras.is_sub_llm_call.
llm_batch accepts a list of strings only; docs and tool help updated accordingly.
rlm_secrets: dataset rows include example_id and task; removed eval split creation; load_environment no longer builds eval_dataset. README adds guidance to re-seed for eval.
Tests updated/added for chat-only sub-LLM path and arg normalization, prompt validation, new trajectory default/guards, and sub-LLM step recording.

^{Written by Cursor Bugbot for commit 631b24a. This will update automatically on new commits. Configure here.}

This reverts commit a1052f8.

verifiers/envs/experimental/rlm_env.py

willccbb · 2026-01-25T22:50:39Z

hmmm would it be possible to have this handled by overriding add_trajectory_step from MultiTurnEnv? maybe we still want to log those steps, but they shouldn't be the in the main sequence in state['trajectory'] if they won't be used for training IMO

this feels like some of the RLM logic is creeping a bit too low into the stack (base Environment shouldn't know what an RLM is or have to think about it), and preparing the context for the API call should probably be handled by get_prompt_messages. If the RLM environment promises that get_prompt_messages only ever contains "increasing" sequences of messages on the subset of turns where a step will be added to state['trajectory'], then the old/current approach should work without changes I think?

In general though, I'm a bit skeptical about trying to shoehorn RLMs into the interleaved strategy. Ultimately we want a single "best of both worlds" strategy which is always TITO, always "just works" with get_prompt_messages, and aggressively interleaves until a message sequence forces a branch

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/envs/experimental/rlm_env.py

snimu added 7 commits January 25, 2026 12:35

Refactor model call path and align sub-LLM tests

3338a57

Unify sub-LLM interleaving and expand tests

8fc1aac

Add task and example_id to rlm_secrets dataset

21ae949

Remove eval dataset params from rlm_secrets

0671c21

Fix interleaved prompt IDs for main RLM

5bd429a

Drain bash REPL output between commands

a1052f8

Revert "Drain bash REPL output between commands"

fcca9fd

This reverts commit a1052f8.

cursor bot reviewed Jan 25, 2026

View reviewed changes

verifiers/envs/experimental/rlm_env.py Outdated Show resolved Hide resolved

Persist cached suffix ids from prompt_state

fd52bee

snimu changed the title ~~Sebastian/rlm sub llm call path 2026 01 25~~ RLM: Fix trajectory collision Jan 25, 2026

snimu requested a review from willccbb January 25, 2026 20:41

snimu added 6 commits January 27, 2026 12:19

Fix RLM interleaving and localize model calls

a84e837

Simplify RLM interleaving behavior

612536a

Remove RLM get_model_response override

050b3bf

Inline RLM message normalization

8c046b5

Use fake state for sub-LLM get_model_response

088f030

don't import unused AsyncOpenAI

3a50483

cursor bot reviewed Jan 28, 2026

View reviewed changes

verifiers/envs/experimental/rlm_env.py Show resolved Hide resolved

Restrict llm_batch prompts to strings

631b24a

willccbb approved these changes Jan 28, 2026

View reviewed changes

snimu merged commit 6ebb4e3 into main Jan 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

RLM: Fix trajectory collision#786

RLM: Fix trajectory collision#786
snimu merged 15 commits intomainfrom
sebastian/rlm-sub-llm-call-path-2026-01-25

snimu commented Jan 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

willccbb commented Jan 25, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

snimu commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

willccbb commented Jan 25, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snimu commented Jan 25, 2026 •

edited

Loading