Skip to content

Conversation

@JasonWei05
Copy link
Contributor

@JasonWei05 JasonWei05 commented Sep 1, 2025

PR Review: Terminal-Bench Terminus 1 Integration (rLLM)

Summary

This PR integrates Terminal Bench’s Terminus 1 agent into rLLM for reproducible terminal task evaluation. We reuse Terminal Bench internals (prompt templates, command execution, evaluation) and only replace the LLM interaction with any RolloutEngine (OpenAI/VERL). The workflow returns a standard rLLM Episode with a single "terminus" trajectory containing step-by-step chat history and model responses.

Requirements: Python ≥ 3.12, pip install terminal-bench, and OPENAI_API_KEY set. Run the example with:

python examples/terminal/run_terminal.py

Files

  • rllm/rllm/integrations/terminal_terminus_1.py

    • Adds RLLMModel, a thin adapter subclassing Terminal-Bench Terminus.
    • Maintains TB Chat-style message history, formats JSON response schema, retries at most three times, parses into CommandBatchResponse.
    • Runs the agent loop and returns (Trajectory, TerminationReason) for packaging by the workflow.
  • rllm/examples/terminal/terminus_workflow.py

    • Adds TerminalTerminusWorkflow using RLLMModel and Terminal-Bench’s terminal/session management.
    • Resets environment, runs the control loop to completion, evaluates by running Terminal-Bench tests, returns an Episode.
  • rllm/examples/terminus/run_terminus.py (example runner)

    • Demonstrates running the full Terminal-Bench dataset with AgentWorkflowEngine using the RolloutEngine.
    • Keeps the engine interchangeable.
  • rllm/examples/terminus/prepare_terminal_data.py (dataset helper)

    • Loads Terminal-Bench datasets via terminal_bench.dataset.Dataset and extracts {task_path, task_id, instruction}.

@jeffreysijuntan
Copy link
Contributor

Great work!

@jeffreysijuntan jeffreysijuntan merged commit 6562aab into rllm-org:v0.2 Sep 4, 2025
1 check passed
yayashuxue added a commit to yayashuxue/rllm that referenced this pull request Sep 6, 2025
- Remove gaia.json dataset (users should download locally via scripts)
- Remove custom calculator_tool.py (use strands_tools.calculator instead)
- Remove backup files and output files (following clean PR practices)
- Standardize calculator usage across all examples

Aligns with PR rllm-org#205 clean practices - only essential code in repo.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
yayashuxue added a commit to yayashuxue/rllm that referenced this pull request Sep 6, 2025
- Remove gaia.json dataset (users should download locally via scripts)
- Remove custom calculator_tool.py (use strands_tools.calculator instead)
- Remove backup files and output files (following clean PR practices)
- Standardize calculator usage across all examples

Aligns with PR rllm-org#205 clean practices - only essential code in repo.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
yayashuxue added a commit to yayashuxue/rllm that referenced this pull request Sep 6, 2025
- Remove gaia.json dataset (users should download locally via scripts)
- Remove custom calculator_tool.py (use strands_tools.calculator instead)
- Remove backup files and output files (following clean PR practices)
- Standardize calculator usage across all examples

Aligns with PR rllm-org#205 clean practices - only essential code in repo.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants