Skip to content

Comments

RLM: Fix trajectory collision#786

Merged
snimu merged 15 commits intomainfrom
sebastian/rlm-sub-llm-call-path-2026-01-25
Jan 28, 2026
Merged

RLM: Fix trajectory collision#786
snimu merged 15 commits intomainfrom
sebastian/rlm-sub-llm-call-path-2026-01-25

Conversation

@snimu
Copy link
Contributor

@snimu snimu commented Jan 25, 2026

Description

  • Fixes interleaved prompt‑ID computation in RLMEnv (root cause of /tokenize empty‑messages errors). Sub‑LLM calls now go through Environment.get_model_response using a fake state with an empty trajectory and mirrored sampling_args/oai_tools, so prompt‑id computation can’t be polluted. The old /chat/completions/tokens path and explicit tokenization are gone for sub‑LLMs
  • include_sub_llm_in_trajectory default is False, and interleaving is explicitly disallowed when it’s True (guards in set_interleaved_rollouts and setup_state)
  • llm_batch is now strings‑only (enforced + documented). Non‑string prompts return an error message
  • Removed sub‑LLM message normalization and sampling‑arg normalization helpers; sampling args are now just defensively copied and normalization is handled by get_model_response
  • RLM‑secrets dataset updates: add example_id/task, remove eval dataset creation, README updated with eval guidance

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Note

Resolves sub-LLM rollout collisions and simplifies request paths; tightens llm_batch usage; and streamlines the rlm_secrets environment.

  • Sub-LLM calls now always go through get_model_response (chat path) with a fake state, ignoring interleaving; removed /chat/completions/tokens and explicit tokenization. Sampling args are mirrored; message/sampling normalization moved to module-level helpers.
  • include_sub_llm_in_trajectory defaults to False. Interleaved rollouts are explicitly disallowed when it’s True (guards in set_interleaved_rollouts and setup_state). Sub-LLM steps can be added to the trajectory with extras.is_sub_llm_call.
  • llm_batch accepts a list of strings only; docs and tool help updated accordingly.
  • rlm_secrets: dataset rows include example_id and task; removed eval split creation; load_environment no longer builds eval_dataset. README adds guidance to re-seed for eval.
  • Tests updated/added for chat-only sub-LLM path and arg normalization, prompt validation, new trajectory default/guards, and sub-LLM step recording.

Written by Cursor Bugbot for commit 631b24a. This will update automatically on new commits. Configure here.

@snimu snimu changed the title Sebastian/rlm sub llm call path 2026 01 25 RLM: Fix trajectory collision Jan 25, 2026
@snimu snimu requested a review from willccbb January 25, 2026 20:41
@willccbb
Copy link
Member

hmmm would it be possible to have this handled by overriding add_trajectory_step from MultiTurnEnv? maybe we still want to log those steps, but they shouldn't be the in the main sequence in state['trajectory'] if they won't be used for training IMO

this feels like some of the RLM logic is creeping a bit too low into the stack (base Environment shouldn't know what an RLM is or have to think about it), and preparing the context for the API call should probably be handled by get_prompt_messages. If the RLM environment promises that get_prompt_messages only ever contains "increasing" sequences of messages on the subset of turns where a step will be added to state['trajectory'], then the old/current approach should work without changes I think?

In general though, I'm a bit skeptical about trying to shoehorn RLMs into the interleaved strategy. Ultimately we want a single "best of both worlds" strategy which is always TITO, always "just works" with get_prompt_messages, and aggressively interleaves until a message sequence forces a branch

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@snimu snimu merged commit 6ebb4e3 into main Jan 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants