Terminal Bench Integration into rLLM (Simplified) #205

JasonWei05 · 2025-09-01T01:44:39Z

PR Review: Terminal-Bench Terminus 1 Integration (rLLM)

Summary

This PR integrates Terminal Bench’s Terminus 1 agent into rLLM for reproducible terminal task evaluation. We reuse Terminal Bench internals (prompt templates, command execution, evaluation) and only replace the LLM interaction with any RolloutEngine (OpenAI/VERL). The workflow returns a standard rLLM Episode with a single "terminus" trajectory containing step-by-step chat history and model responses.

Requirements: Python ≥ 3.12, pip install terminal-bench, and OPENAI_API_KEY set. Run the example with:

python examples/terminal/run_terminal.py

Files

rllm/rllm/integrations/terminal_terminus_1.py
- Adds RLLMModel, a thin adapter subclassing Terminal-Bench Terminus.
- Maintains TB Chat-style message history, formats JSON response schema, retries at most three times, parses into CommandBatchResponse.
- Runs the agent loop and returns (Trajectory, TerminationReason) for packaging by the workflow.
rllm/examples/terminal/terminus_workflow.py
- Adds TerminalTerminusWorkflow using RLLMModel and Terminal-Bench’s terminal/session management.
- Resets environment, runs the control loop to completion, evaluates by running Terminal-Bench tests, returns an Episode.
rllm/examples/terminus/run_terminus.py (example runner)
- Demonstrates running the full Terminal-Bench dataset with AgentWorkflowEngine using the RolloutEngine.
- Keeps the engine interchangeable.
rllm/examples/terminus/prepare_terminal_data.py (dataset helper)
- Loads Terminal-Bench datasets via terminal_bench.dataset.Dataset and extracts {task_path, task_id, instruction}.

jeffreysijuntan · 2025-09-04T06:17:00Z

Great work!

- Remove gaia.json dataset (users should download locally via scripts) - Remove custom calculator_tool.py (use strands_tools.calculator instead) - Remove backup files and output files (following clean PR practices) - Standardize calculator usage across all examples Aligns with PR rllm-org#205 clean practices - only essential code in repo. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

JasonWei05 added 4 commits September 1, 2025 01:22

Terminal Bench integration into rLLM with new workflow and integration.

0c83026

Fixed typo

6d2f6cf

Fixed coding style using pre-commit.

55f514a

Added README to examples/terminal/

5e3e47f

jeffreysijuntan merged commit 6562aab into rllm-org:v0.2 Sep 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Terminal Bench Integration into rLLM (Simplified) #205

Terminal Bench Integration into rLLM (Simplified) #205

Uh oh!

JasonWei05 commented Sep 1, 2025 •

edited

Loading

Uh oh!

jeffreysijuntan commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Terminal Bench Integration into rLLM (Simplified) #205

Terminal Bench Integration into rLLM (Simplified) #205

Uh oh!

Conversation

JasonWei05 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Terminal-Bench Terminus 1 Integration (rLLM)

Summary

Files

Uh oh!

jeffreysijuntan commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonWei05 commented Sep 1, 2025 •

edited

Loading