Terminal Bench Integration into rLLM (Simplified) #205
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Review: Terminal-Bench Terminus 1 Integration (rLLM)
Summary
This PR integrates Terminal Bench’s Terminus 1 agent into rLLM for reproducible terminal task evaluation. We reuse Terminal Bench internals (prompt templates, command execution, evaluation) and only replace the LLM interaction with any
RolloutEngine(OpenAI/VERL). The workflow returns a standard rLLMEpisodewith a single "terminus" trajectory containing step-by-step chat history and model responses.Requirements: Python ≥ 3.12,
pip install terminal-bench, andOPENAI_API_KEYset. Run the example with:Files
rllm/rllm/integrations/terminal_terminus_1.py
RLLMModel, a thin adapter subclassing Terminal-BenchTerminus.CommandBatchResponse.(Trajectory, TerminationReason)for packaging by the workflow.rllm/examples/terminal/terminus_workflow.py
TerminalTerminusWorkflowusingRLLMModeland Terminal-Bench’s terminal/session management.Episode.rllm/examples/terminus/run_terminus.py (example runner)
AgentWorkflowEngineusing theRolloutEngine.rllm/examples/terminus/prepare_terminal_data.py (dataset helper)
terminal_bench.dataset.Datasetand extracts{task_path, task_id, instruction}.