Skip to content

Commit 951f7a1

Browse files
No eval shortcut (#152)
* Disable auto_eval_on_rewrite * Use jinja2 to render system prompt * Load template from config or template file. Add base_agent tests. * Add jinja to requirements * Render system prompt from agent + info data * Only load system prompt template from jinja2 file, not plain text. Raises FileNotFoundError if system_prompt_template_file not found * removed unused imports * Refactor system prompt template to use Jinja2 for rendering and add custom JSON filter * Add comments to optionally load a custom system prompt template in configuration files * Disable auto_eval_on_rewrite for swe-smith * minor * Build default system prompt from dict. Add trim back * fix test * Simplify env instruction for better rendering * removed BASE_SYSTEM_PROMPT_TEMPLATE * Update readme with instructions to use jinja * Agent filter trims from middle by default * Add human friendly system prompt template * Add jinja templates to MANIFEST.in * merge agents tests --------- Co-authored-by: Xingdi (Eric) Yuan <xingdi-eric-yuan@users.noreply.github.com>
1 parent 8246c98 commit 951f7a1

23 files changed

+475
-200
lines changed

MANIFEST.in

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
include debug_gym/envs/configs/*.yaml
1+
include debug_gym/envs/configs/*.yaml
2+
include scripts/templates/*.jinja

README.md

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -137,28 +137,89 @@ We provide a human mode that enables developers to manually interact with `debug
137137

138138
#### 3.3. Overriding Values in Config
139139

140-
`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
140+
The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
141141

142-
python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
142+
python scripts/run.py scripts/config_aider.yaml \
143+
--agent debug_agent \
144+
-v \
145+
-p debug_agent.llm_name="human" \
146+
-p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"
143147

144-
#### 3.4. Debugging a Custom Repository
148+
149+
#### 3.4. Customizing the System Prompt with Jinja Templates
150+
151+
`debug-gym` allows you to fully customize the system prompt by providing a [Jinja](https://jinja.palletsprojects.com/) template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.
152+
153+
To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under `system_prompt_template_file`. For example:
154+
155+
```yaml
156+
debug_agent:
157+
system_prompt_template_file: scripts/templates/custom_system_prompt.jinja
158+
```
159+
160+
Alternatively, you can provide a custom template from the command line with `-p <agent>.system_prompt_template_file="<path/to/template.jinja>"` (see above).
161+
162+
Within your Jinja template, you have access to the `agent` and `info` objects, which provide all relevant context about the current environment and agent state.
163+
164+
#### Custom Jinja Filters
165+
166+
In addition to all [built-in Jinja filters](https://jinja.palletsprojects.com/en/stable/templates/#list-of-builtin-filters), two custom filters are available for use in your template:
167+
168+
- **`to_pretty_json`**: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
169+
```jinja
170+
{{ info.tools | to_pretty_json }}
171+
```
172+
173+
- **`trim_message`**: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The `trim_message` filter accepts the following arguments to control how messages are trimmed:
174+
- **`max_length`**: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
175+
- **`max_length_percentage`**: Instead of specifying an absolute number, you can provide a percentage (e.g., `0.1` for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
176+
- **`where`**: Specifies where to trim the message if it exceeds the limit. The default is `"middle"`, which trims from the middle of the message. Other options are `start` or `end`.
177+
178+
```jinja
179+
{{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
180+
```
181+
182+
#### Example Template
183+
184+
```jinja
185+
System Prompt for Debug-Gym
186+
187+
Task: {{ agent.system_prompt }}
188+
189+
Instructions:
190+
{{ info.instructions }}
191+
192+
Directory Tree:
193+
{{ info.dir_tree | trim_message(max_length=1000) }}
194+
195+
Current Breakpoints:
196+
{{ info.current_breakpoints | to_pretty_json }}
197+
198+
{% if agent.shortcut_features() %}
199+
Shortcut Features:
200+
{{ agent.shortcut_features() | to_pretty_json }}
201+
{% endif %}
202+
```
203+
204+
205+
#### 3.5. Debugging a Custom Repository
145206

146207
Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
147208

148209
As an example, we provide a buggy pytorch code repository in `data/pytorch`.
149210

150211
python scripts/run.py scripts/config.yaml --agent <agent name>
151212

152-
#### 3.5. Debugging a Custom SWE-Smith Instance
213+
#### 3.6. Debugging a Custom SWE-Smith Instance
153214

154215
[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
155216

156217
python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
157218

158-
#### 3.6. Design Your Own Tool
219+
#### 3.7. Design Your Own Tool
159220
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
160221

161-
#### 3.7. Analysis and Visualization
222+
#### 3.8. Analysis and Visualization
162223

163224
We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
164225
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.

debug_gym/agents/base_agent.py

Lines changed: 110 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
import os
33
import subprocess
44
import uuid
5-
from collections import OrderedDict
65
from os.path import join as pjoin
76

87
import numpy as np
8+
from jinja2 import Environment, Template
99

1010
from debug_gym.agents.history_tracker import HistoryTracker, build_history_prompt
1111
from debug_gym.agents.utils import trim
@@ -72,72 +72,120 @@ def parse_reasoning_model_response(self, response, reasoning_end_token):
7272
response = response[reasoning_end:].strip()
7373
return response
7474

75-
def build_system_prompt(self, info):
76-
def calc_tokens_left(system_prompt: dict):
77-
system_prompt = filter_non_utf8(
78-
json.dumps(system_prompt, indent=2, sort_keys=False)
79-
)
80-
return self.llm.context_length - self.llm.count_tokens(system_prompt)
81-
82-
system_prompt = OrderedDict()
83-
system_prompt["Overall task"] = self.system_prompt
84-
system_prompt["Instructions"] = info.instructions
85-
if self.llm.context_length is not None and self.llm.count_tokens is not None:
86-
system_prompt["Repo directory tree"] = trim(
87-
info.dir_tree,
88-
min(
89-
int(0.1 * self.llm.context_length), calc_tokens_left(system_prompt)
90-
),
91-
count_tokens=self.llm.count_tokens,
92-
where="end",
75+
def _auto_eval_on_rewrite(self):
76+
"""Check if auto eval on rewrite is enabled."""
77+
return self.config.get("env_kwargs", {}).get("auto_eval_on_rewrite", False)
78+
79+
def shortcut_features(self):
80+
features = []
81+
if self._auto_eval_on_rewrite():
82+
features.append(
83+
"After successful rewrites, the environment will automatically "
84+
"call the Eval tool to evaluate the rewritten code. Therefore, "
85+
"you do not need to call the Eval tool yourself. The evaluation "
86+
"output will be updated automatically in the system prompt."
9387
)
94-
else:
95-
system_prompt["Repo directory tree"] = info.dir_tree
96-
system_prompt["Current breakpoints"] = info.current_breakpoints
88+
if self.env.has_tool("pdb"):
89+
if self.config.get("env_kwargs", {}).get("persistent_breakpoints"):
90+
features.append(
91+
"The environment will automatically restore existing breakpoints "
92+
"when a new PDB session is started (e.g., after a rewrite)."
93+
)
94+
if self.config.get("env_kwargs", {}).get("auto_list"):
95+
features.append(
96+
"After every valid PDB tool calling, the environment will "
97+
"automatically call the PDB tool again with a `list .` command, "
98+
"which will show the code around the current frame."
99+
)
100+
return features
101+
102+
@staticmethod
103+
def to_pretty_json(value):
104+
"""Convert a value to a pretty JSON string."""
105+
return json.dumps(value, indent=2, sort_keys=False)
97106

98-
if self.llm.context_length is not None and self.llm.count_tokens is not None:
99-
system_prompt["Evaluation output of current code"] = trim(
107+
def trim_message(
108+
self,
109+
message,
110+
count_tokens=None,
111+
max_length=None,
112+
max_length_percentage=0,
113+
where="middle",
114+
):
115+
"""Filter non utf8 and trim the message to fit within the token limit.
116+
If the message exceeds the max_length, it will be trimmed to fit.
117+
The `max_length` can be specified as an absolute value or a percentage
118+
of the LLM's context length, if any."""
119+
message = filter_non_utf8(message)
120+
count_tokens = count_tokens or self.llm.count_tokens
121+
max_length = (
122+
max_length
123+
or max_length_percentage * self.llm.context_length
124+
or self.llm.context_length
125+
)
126+
127+
if count_tokens is None or max_length is None or max_length <= 0:
128+
return message
129+
tokens = count_tokens(message)
130+
if tokens > max_length:
131+
return trim(message, max_length, count_tokens=count_tokens, where=where)
132+
return message
133+
134+
def _load_system_prompt_template(self) -> Template | None:
135+
"""Load system prompt template from config if specified and register custom filters.
136+
If no template is specified, return None.
137+
"""
138+
system_prompt_template = self.config.get("system_prompt_template_file")
139+
if system_prompt_template:
140+
if not os.path.isfile(system_prompt_template):
141+
error_msg = (
142+
f"System prompt template file `{system_prompt_template}` not found."
143+
)
144+
self.logger.error(error_msg)
145+
raise FileNotFoundError(error_msg)
146+
with open(system_prompt_template, "r") as f:
147+
system_prompt_template = f.read()
148+
# Add custom filter to Jinja2 environment
149+
env = Environment()
150+
env.filters["to_pretty_json"] = self.to_pretty_json
151+
env.filters["trim_message"] = self.trim_message
152+
return env.from_string(system_prompt_template)
153+
return None
154+
155+
def _default_system_prompt(self, info) -> str:
156+
"""Return the default system prompt as pretty JSON.
157+
Trimmed to fit within the token limit."""
158+
159+
system_prompt_dict = {
160+
"Overall task": self.system_prompt,
161+
"Instructions": info.instructions,
162+
"Repo directory tree": self.trim_message(
163+
info.dir_tree, max_length_percentage=0.1, where="end"
164+
),
165+
"Current breakpoints": info.current_breakpoints,
166+
}
167+
168+
if self._auto_eval_on_rewrite():
169+
system_prompt_dict["Evaluation output of current code"] = self.trim_message(
100170
info.eval_observation.observation,
101-
min(
102-
int(0.8 * self.llm.context_length), calc_tokens_left(system_prompt)
103-
),
104-
count_tokens=self.llm.count_tokens,
171+
max_length_percentage=0.8,
105172
where="middle",
106173
)
107-
else:
108-
system_prompt["Evaluation output of current code"] = (
109-
info.eval_observation.observation
110-
)
111174

112-
shortcut_features = []
113-
if self.config.get("env_kwargs", {}).get("auto_eval_on_rewrite") is True:
114-
shortcut_features.append(
115-
"After successful rewrites, the environment will automatically call the Eval tool to evaluate the rewritten code. Therefore, you do not need to call the Eval tool yourself. The evaluation output will be updated automatically in the system prompt."
116-
)
117-
if self.config.get("env_kwargs", {}).get(
118-
"persistent_breakpoints"
119-
) is True and self.env.has_tool("pdb"):
120-
shortcut_features.append(
121-
"The environment will automatically restore existing breakpoints when a new PDB session is started (e.g., after a rewrite)."
122-
)
123-
if self.config.get("env_kwargs", {}).get(
124-
"auto_list"
125-
) is True and self.env.has_tool("pdb"):
126-
shortcut_features.append(
127-
"After every valid PDB tool calling, the environment will automatically call the PDB tool again with a `list .` command, which will show the code around the current frame."
128-
)
129-
if len(shortcut_features) > 0:
130-
system_prompt["Shortcut features"] = shortcut_features
175+
shortcut_features = self.shortcut_features()
176+
if shortcut_features:
177+
system_prompt_dict["Shortcut features"] = shortcut_features
131178

132-
system_prompt = filter_non_utf8(
133-
json.dumps(system_prompt, indent=2, sort_keys=False)
134-
)
135-
messages = [
136-
{
137-
"role": "system",
138-
"content": system_prompt,
139-
}
140-
]
179+
return self.to_pretty_json(system_prompt_dict)
180+
181+
def build_system_prompt(self, info):
182+
"""Build system prompt using jinja template from config or default template."""
183+
system_prompt_template = self._load_system_prompt_template()
184+
if system_prompt_template is not None:
185+
system_prompt = system_prompt_template.render(agent=self, info=info)
186+
else:
187+
system_prompt = self._default_system_prompt(info)
188+
messages = [{"role": "system", "content": filter_non_utf8(system_prompt)}]
141189
return messages
142190

143191
def build_question_prompt(self):
@@ -146,7 +194,8 @@ def build_question_prompt(self):
146194
return messages
147195

148196
def build_prompt(self, info):
149-
messages = self.build_system_prompt(info)
197+
messages = []
198+
messages.extend(self.build_system_prompt(info))
150199
messages.extend(self.build_history_prompt())
151200
messages.extend(self.build_question_prompt())
152201
return messages

debug_gym/agents/debug_agent.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
from debug_gym.agents.base_agent import BaseAgent, register_agent
2-
from debug_gym.llms.base import LLM
32

43

54
@register_agent

debug_gym/agents/solution_agent.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
1-
import subprocess
2-
31
from debug_gym.agents.base_agent import BaseAgent, register_agent
4-
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
5-
from debug_gym.gym.envs.swe_smith import SWESmithEnv
62
from debug_gym.gym.tools.tool import ToolCall
73

84

debug_gym/gym/envs/aider.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,8 @@ class AiderBenchmarkEnv(RepoEnv):
1212
REPO_PATH = Path.joinpath(Path.home(), ".cache", "debug_gym", "exercism")
1313

1414
@property
15-
def instructions(self):
16-
return {
17-
**super().instructions,
18-
"Problem description": self.current_sample["instructions"],
19-
}
15+
def instructions(self) -> str:
16+
return self.current_sample["instructions"]
2017

2118
def __init__(self, entrypoint: str = "python -m pytest -s .", **kwargs):
2219
super().__init__(entrypoint=entrypoint, **kwargs)

debug_gym/gym/envs/env.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -272,9 +272,8 @@ def cleanup_workspace(self):
272272
self.tempdir.cleanup()
273273

274274
@property
275-
def instructions(self):
276-
_instruction = {}
277-
return _instruction
275+
def instructions(self) -> str:
276+
return ""
278277

279278
def display_files(self):
280279
msg = (

debug_gym/gym/envs/mini_nightmare.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,8 @@ class MiniNightmareEnv(RepoEnv):
2222
]
2323

2424
@property
25-
def instructions(self):
26-
return {
27-
**super().instructions,
28-
"Problem description": self.current_sample["instructions"],
29-
}
25+
def instructions(self) -> str:
26+
return self.current_sample["instructions"]
3027

3128
def __init__(self, entrypoint: str = "python -m pytest -s test.py", **kwargs):
3229
super().__init__(entrypoint=entrypoint, **kwargs)

debug_gym/gym/envs/swe_bench.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,8 @@ def __init__(
5151
self.test_directives = []
5252

5353
@property
54-
def instructions(self):
55-
return {
56-
**super().instructions,
57-
"Problem description": self.ds_row["problem_statement"],
58-
}
54+
def instructions(self) -> str:
55+
return self.ds_row["problem_statement"]
5956

6057
def load_dataset(self):
6158
self.ds = datasets.load_dataset(self.dataset_id)[self.split]

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,5 @@ docker
1313
swebench==4.0.3
1414
swesmith
1515
prompt_toolkit
16-
anthropic>=0.49.0
16+
anthropic>=0.49.0
17+
jinja2

0 commit comments

Comments
 (0)