created the rlm_secrets environment by snimu · Pull Request #763 · PrimeIntellect-ai/verifiers

snimu · 2026-01-22T09:27:00Z

Description

Secrets: basic idea of the eval is that the model must call functions in a specific order and pass information between them in order to solve a puzzle. This is designed in a way to force the model to use sub-LLMs, and those sub-LLMs to use tools, and the secrets can also be in files, so that file access is tested.

Setup:
- Theres a bunch of files with random names all containing random UUIDs as content
- The prompt tells the RLM that it must first find the correct order of the filenames (which are are also randomized), and then delete all files but the correct one and answer with the position of the right file
Available functions:
- To root-LLM:
  - decrypt_position(file_name: str, code: str) -> int | str
    - returns the position if the code was right
    - returns an error message if the code was wrong
    - internally, just checks if the code is the correct random string for the file_name (and that file_name is one of the files) or not and returns the corresponding output
  - unveil_file_number(sorted_filenames: list[str]) -> int | str
    - if the filenames are passed in the correct order, it returns the position of the file that should remain
    - otherwise, it returns an error message
- To sub-LLMs:
  - get_code_from_file_data(filename: str, filecontent: str) -> str
    - If filename exists and filecontent is its actual content, the function returns another random code which can be used by the root-LLM to decrypt_position with the filename and the content
    - Otherwise, it returns a random code generated on the fly that looks just like a correct code would, and thus still forces the root-LLM to call decrypt_position and check
Example rollout:
- root-LLM uses bash to list the available files
- it calls a sub-llm with the filename and content (which it had to read via bash) and prompts it to give it the corresponding code
- the sub-LLM calls get_code_from_filesystem and returns the code
- the root-LLM calls decrypt_position
- if there's an error the root-LLM will try again, otherwise the position is clear
- this is repeated with all the files
- the root-LLM now know the relative position of all files, so it calls unveil_file_number with the correctly ordered filenames; and either it has to retry everything, or it gets a valid number
- the root-LLM deletes all files but the correct one and answers with the file number in its final response
Which parts of the RLM are touched:
- Tools on all levels
- Sub-LLM calls
- File operations: listing, reading, deleting
And yet it's a very simple environment and codex can probably one-shot the implementation, too

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Adds a new RLM evaluation environment focused on multi-turn tool use and file operations.

New environments/rlm_secrets/rlm_secrets.py: defines RLMSecretsEnv with root tools (decrypt_position, unveil_file_number), sub-LLM tool (get_code_from_file_data), filesystem setup of random .txt files, dataset builder, and reward functions (correct_answer, correct_filesystem_state).
Implements sub-LLM invocation via llm_batch with state injection for sub-tools and retains rollout filesystem for verification/cleanup.
New environments/rlm_secrets/README.md: docs for puzzle flow, tools, usage (uv run vf-eval rlm-secrets), config, and rewards.
New environments/rlm_secrets/pyproject.toml: package metadata, verifiers>=0.1.8 dependency, build config, and eval settings.

^{Written by Cursor Bugbot for commit 5295d0a. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

environments/rlm_secrets/rlm_secrets.py

created the rlm_secrets environment

b171dc5

This comment was marked as outdated.

Sign in to view

make reward function more robust

6bb8828

This comment was marked as outdated.

Sign in to view

snimu added 3 commits January 22, 2026 10:58

make uuid generation repruducible

1c0fd3d

make seed None by default

ec2d964

fix directory cleanup

add857c

cursor bot reviewed Jan 22, 2026

View reviewed changes

environments/rlm_secrets/rlm_secrets.py Outdated Show resolved Hide resolved

environments/rlm_secrets/rlm_secrets.py Outdated Show resolved Hide resolved

fix random seed setup

5295d0a

snimu merged commit 177baaf into main Jan 22, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

created the rlm_secrets environment#763

created the rlm_secrets environment#763
snimu merged 6 commits intomainfrom
sebastian/rlm-secrets-env-2026-01-22

snimu commented Jan 22, 2026 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

snimu commented Jan 22, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

snimu commented Jan 22, 2026 •

edited by cursor bot

Loading