Skip to content

No eval shortcut #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jun 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
91218b7
Disable auto_eval_on_rewrite
matheper Jun 10, 2025
8c55620
Use jinja2 to render system prompt
matheper Jun 10, 2025
c15846c
Load template from config or template file. Add base_agent tests.
matheper Jun 11, 2025
99ef814
Add jinja to requirements
matheper Jun 11, 2025
4a3cc9e
Render system prompt from agent + info data
matheper Jun 11, 2025
73ab884
Only load system prompt template from jinja2 file, not plain text.
matheper Jun 12, 2025
9e25dfe
Merge branch 'main' into no-eval-shortcut
matheper Jun 12, 2025
e4b0131
removed unused imports
matheper Jun 12, 2025
8085ece
Refactor system prompt template to use Jinja2 for rendering and add c…
matheper Jun 12, 2025
f3efe36
Add comments to optionally load a custom system prompt template in co…
matheper Jun 12, 2025
bd04e82
Disable auto_eval_on_rewrite for swe-smith
matheper Jun 12, 2025
088ee63
minor
xingdi-eric-yuan Jun 12, 2025
df9de01
Build default system prompt from dict. Add trim back
matheper Jun 12, 2025
6e271d9
fix test
matheper Jun 12, 2025
76bf08a
Simplify env instruction for better rendering
matheper Jun 12, 2025
cddf22a
removed BASE_SYSTEM_PROMPT_TEMPLATE
matheper Jun 12, 2025
cf35191
Update readme with instructions to use jinja
matheper Jun 12, 2025
b6e0b3b
Agent filter trims from middle by default
matheper Jun 12, 2025
b2a6800
Add human friendly system prompt template
matheper Jun 12, 2025
9a3fdf7
Add jinja templates to MANIFEST.in
matheper Jun 12, 2025
5859545
Merge pull request #155 from microsoft/default-dict-template
matheper Jun 12, 2025
b98a343
merge agents tests
matheper Jun 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
include debug_gym/envs/configs/*.yaml
include debug_gym/envs/configs/*.yaml
include scripts/templates/*.jinja
73 changes: 67 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,28 +137,89 @@ We provide a human mode that enables developers to manually interact with `debug

#### 3.3. Overriding Values in Config

`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).

python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
python scripts/run.py scripts/config_aider.yaml \
--agent debug_agent \
-v \
-p debug_agent.llm_name="human" \
-p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"

#### 3.4. Debugging a Custom Repository

#### 3.4. Customizing the System Prompt with Jinja Templates

`debug-gym` allows you to fully customize the system prompt by providing a [Jinja](https://jinja.palletsprojects.com/) template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.

To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under `system_prompt_template_file`. For example:

```yaml
debug_agent:
system_prompt_template_file: scripts/templates/custom_system_prompt.jinja
```

Alternatively, you can provide a custom template from the command line with `-p <agent>.system_prompt_template_file="<path/to/template.jinja>"` (see above).

Within your Jinja template, you have access to the `agent` and `info` objects, which provide all relevant context about the current environment and agent state.

#### Custom Jinja Filters

In addition to all [built-in Jinja filters](https://jinja.palletsprojects.com/en/stable/templates/#list-of-builtin-filters), two custom filters are available for use in your template:

- **`to_pretty_json`**: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
```jinja
{{ info.tools | to_pretty_json }}
```

- **`trim_message`**: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The `trim_message` filter accepts the following arguments to control how messages are trimmed:
- **`max_length`**: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
- **`max_length_percentage`**: Instead of specifying an absolute number, you can provide a percentage (e.g., `0.1` for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
- **`where`**: Specifies where to trim the message if it exceeds the limit. The default is `"middle"`, which trims from the middle of the message. Other options are `start` or `end`.

```jinja
{{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
```

#### Example Template

```jinja
System Prompt for Debug-Gym

Task: {{ agent.system_prompt }}

Instructions:
{{ info.instructions }}

Directory Tree:
{{ info.dir_tree | trim_message(max_length=1000) }}

Current Breakpoints:
{{ info.current_breakpoints | to_pretty_json }}

{% if agent.shortcut_features() %}
Shortcut Features:
{{ agent.shortcut_features() | to_pretty_json }}
{% endif %}
```


#### 3.5. Debugging a Custom Repository

Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.

As an example, we provide a buggy pytorch code repository in `data/pytorch`.

python scripts/run.py scripts/config.yaml --agent <agent name>

#### 3.5. Debugging a Custom SWE-Smith Instance
#### 3.6. Debugging a Custom SWE-Smith Instance

[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:

python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"

#### 3.6. Design Your Own Tool
#### 3.7. Design Your Own Tool
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).

#### 3.7. Analysis and Visualization
#### 3.8. Analysis and Visualization

We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.
Expand Down
171 changes: 110 additions & 61 deletions debug_gym/agents/base_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
import os
import subprocess
import uuid
from collections import OrderedDict
from os.path import join as pjoin

import numpy as np
from jinja2 import Environment, Template

from debug_gym.agents.history_tracker import HistoryTracker, build_history_prompt
from debug_gym.agents.utils import trim
Expand Down Expand Up @@ -72,72 +72,120 @@ def parse_reasoning_model_response(self, response, reasoning_end_token):
response = response[reasoning_end:].strip()
return response

def build_system_prompt(self, info):
def calc_tokens_left(system_prompt: dict):
system_prompt = filter_non_utf8(
json.dumps(system_prompt, indent=2, sort_keys=False)
)
return self.llm.context_length - self.llm.count_tokens(system_prompt)

system_prompt = OrderedDict()
system_prompt["Overall task"] = self.system_prompt
system_prompt["Instructions"] = info.instructions
if self.llm.context_length is not None and self.llm.count_tokens is not None:
system_prompt["Repo directory tree"] = trim(
info.dir_tree,
min(
int(0.1 * self.llm.context_length), calc_tokens_left(system_prompt)
),
count_tokens=self.llm.count_tokens,
where="end",
def _auto_eval_on_rewrite(self):
"""Check if auto eval on rewrite is enabled."""
return self.config.get("env_kwargs", {}).get("auto_eval_on_rewrite", False)

def shortcut_features(self):
features = []
if self._auto_eval_on_rewrite():
features.append(
"After successful rewrites, the environment will automatically "
"call the Eval tool to evaluate the rewritten code. Therefore, "
"you do not need to call the Eval tool yourself. The evaluation "
"output will be updated automatically in the system prompt."
)
else:
system_prompt["Repo directory tree"] = info.dir_tree
system_prompt["Current breakpoints"] = info.current_breakpoints
if self.env.has_tool("pdb"):
if self.config.get("env_kwargs", {}).get("persistent_breakpoints"):
features.append(
"The environment will automatically restore existing breakpoints "
"when a new PDB session is started (e.g., after a rewrite)."
)
if self.config.get("env_kwargs", {}).get("auto_list"):
features.append(
"After every valid PDB tool calling, the environment will "
"automatically call the PDB tool again with a `list .` command, "
"which will show the code around the current frame."
)
return features

@staticmethod
def to_pretty_json(value):
"""Convert a value to a pretty JSON string."""
return json.dumps(value, indent=2, sort_keys=False)

if self.llm.context_length is not None and self.llm.count_tokens is not None:
system_prompt["Evaluation output of current code"] = trim(
def trim_message(
self,
message,
count_tokens=None,
max_length=None,
max_length_percentage=0,
where="middle",
):
"""Filter non utf8 and trim the message to fit within the token limit.
If the message exceeds the max_length, it will be trimmed to fit.
The `max_length` can be specified as an absolute value or a percentage
of the LLM's context length, if any."""
message = filter_non_utf8(message)
count_tokens = count_tokens or self.llm.count_tokens
max_length = (
max_length
or max_length_percentage * self.llm.context_length
or self.llm.context_length
)

if count_tokens is None or max_length is None or max_length <= 0:
return message
tokens = count_tokens(message)
if tokens > max_length:
return trim(message, max_length, count_tokens=count_tokens, where=where)
return message

def _load_system_prompt_template(self) -> Template | None:
"""Load system prompt template from config if specified and register custom filters.
If no template is specified, return None.
"""
system_prompt_template = self.config.get("system_prompt_template_file")
if system_prompt_template:
if not os.path.isfile(system_prompt_template):
error_msg = (
f"System prompt template file `{system_prompt_template}` not found."
)
self.logger.error(error_msg)
raise FileNotFoundError(error_msg)
with open(system_prompt_template, "r") as f:
system_prompt_template = f.read()
# Add custom filter to Jinja2 environment
env = Environment()
env.filters["to_pretty_json"] = self.to_pretty_json
env.filters["trim_message"] = self.trim_message
return env.from_string(system_prompt_template)
return None

def _default_system_prompt(self, info) -> str:
"""Return the default system prompt as pretty JSON.
Trimmed to fit within the token limit."""

system_prompt_dict = {
"Overall task": self.system_prompt,
"Instructions": info.instructions,
"Repo directory tree": self.trim_message(
info.dir_tree, max_length_percentage=0.1, where="end"
),
"Current breakpoints": info.current_breakpoints,
}

if self._auto_eval_on_rewrite():
system_prompt_dict["Evaluation output of current code"] = self.trim_message(
info.eval_observation.observation,
min(
int(0.8 * self.llm.context_length), calc_tokens_left(system_prompt)
),
count_tokens=self.llm.count_tokens,
max_length_percentage=0.8,
where="middle",
)
else:
system_prompt["Evaluation output of current code"] = (
info.eval_observation.observation
)

shortcut_features = []
if self.config.get("env_kwargs", {}).get("auto_eval_on_rewrite") is True:
shortcut_features.append(
"After successful rewrites, the environment will automatically call the Eval tool to evaluate the rewritten code. Therefore, you do not need to call the Eval tool yourself. The evaluation output will be updated automatically in the system prompt."
)
if self.config.get("env_kwargs", {}).get(
"persistent_breakpoints"
) is True and self.env.has_tool("pdb"):
shortcut_features.append(
"The environment will automatically restore existing breakpoints when a new PDB session is started (e.g., after a rewrite)."
)
if self.config.get("env_kwargs", {}).get(
"auto_list"
) is True and self.env.has_tool("pdb"):
shortcut_features.append(
"After every valid PDB tool calling, the environment will automatically call the PDB tool again with a `list .` command, which will show the code around the current frame."
)
if len(shortcut_features) > 0:
system_prompt["Shortcut features"] = shortcut_features
shortcut_features = self.shortcut_features()
if shortcut_features:
system_prompt_dict["Shortcut features"] = shortcut_features

system_prompt = filter_non_utf8(
json.dumps(system_prompt, indent=2, sort_keys=False)
)
messages = [
{
"role": "system",
"content": system_prompt,
}
]
return self.to_pretty_json(system_prompt_dict)

def build_system_prompt(self, info):
"""Build system prompt using jinja template from config or default template."""
system_prompt_template = self._load_system_prompt_template()
if system_prompt_template is not None:
system_prompt = system_prompt_template.render(agent=self, info=info)
else:
system_prompt = self._default_system_prompt(info)
messages = [{"role": "system", "content": filter_non_utf8(system_prompt)}]
return messages

def build_question_prompt(self):
Expand All @@ -146,7 +194,8 @@ def build_question_prompt(self):
return messages

def build_prompt(self, info):
messages = self.build_system_prompt(info)
messages = []
messages.extend(self.build_system_prompt(info))
messages.extend(self.build_history_prompt())
messages.extend(self.build_question_prompt())
return messages
Expand Down
1 change: 0 additions & 1 deletion debug_gym/agents/debug_agent.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
from debug_gym.agents.base_agent import BaseAgent, register_agent
from debug_gym.llms.base import LLM


@register_agent
Expand Down
4 changes: 0 additions & 4 deletions debug_gym/agents/solution_agent.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
import subprocess

from debug_gym.agents.base_agent import BaseAgent, register_agent
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
from debug_gym.gym.envs.swe_smith import SWESmithEnv
from debug_gym.gym.tools.tool import ToolCall


Expand Down
7 changes: 2 additions & 5 deletions debug_gym/gym/envs/aider.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,8 @@ class AiderBenchmarkEnv(RepoEnv):
REPO_PATH = Path.joinpath(Path.home(), ".cache", "debug_gym", "exercism")

@property
def instructions(self):
return {
**super().instructions,
"Problem description": self.current_sample["instructions"],
}
def instructions(self) -> str:
return self.current_sample["instructions"]

def __init__(self, entrypoint: str = "python -m pytest -s .", **kwargs):
super().__init__(entrypoint=entrypoint, **kwargs)
Expand Down
5 changes: 2 additions & 3 deletions debug_gym/gym/envs/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,9 +272,8 @@ def cleanup_workspace(self):
self.tempdir.cleanup()

@property
def instructions(self):
_instruction = {}
return _instruction
def instructions(self) -> str:
return ""

def display_files(self):
msg = (
Expand Down
7 changes: 2 additions & 5 deletions debug_gym/gym/envs/mini_nightmare.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,8 @@ class MiniNightmareEnv(RepoEnv):
]

@property
def instructions(self):
return {
**super().instructions,
"Problem description": self.current_sample["instructions"],
}
def instructions(self) -> str:
return self.current_sample["instructions"]

def __init__(self, entrypoint: str = "python -m pytest -s test.py", **kwargs):
super().__init__(entrypoint=entrypoint, **kwargs)
Expand Down
7 changes: 2 additions & 5 deletions debug_gym/gym/envs/swe_bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,8 @@ def __init__(
self.test_directives = []

@property
def instructions(self):
return {
**super().instructions,
"Problem description": self.ds_row["problem_statement"],
}
def instructions(self) -> str:
return self.ds_row["problem_statement"]

def load_dataset(self):
self.ds = datasets.load_dataset(self.dataset_id)[self.split]
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ docker
swebench==4.0.3
swesmith
prompt_toolkit
anthropic>=0.49.0
anthropic>=0.49.0
jinja2
4 changes: 3 additions & 1 deletion scripts/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ base:
"dir_tree_depth": 1,
"run_timeout": 10,
# shortcut features
"auto_eval_on_rewrite": True, # The environment will automatically call the Eval tool after a successful rewrite. If this is set to True, the agent does not need to call the Eval tool itself.
"auto_eval_on_rewrite": False, # The environment will automatically call the Eval tool after a successful rewrite. If this is set to True, the agent does not need to call the Eval tool itself.
"persistent_breakpoints": True, # The environemnt will keep a set of breakpoint states across PDB sessions. When a new PDB session is started, the environment will automatically load the breakpoints from the previous session.
"auto_list": True, # The environment will automatically call `list .` via the PDB tool after every pdb tool call, which will show the code around the current frame.
}
Expand All @@ -33,6 +33,8 @@ base:
save_patch: True
log_prompt_response_pairs: True
reset_prompt_history_after_rewrite: True
# Optionally loads a custom system prompt template from a file.
# system_prompt_template_file: "script/templates/system_prompt.jinja"

rewrite_agent:
tools: ["view", "rewrite", "eval"]
Expand Down
Loading