microsoft · MarcCote · Jun 13, 2025 · Jun 10, 2025 · Jun 10, 2025 · Jun 11, 2025
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1 +1,2 @@
-include debug_gym/envs/configs/*.yaml
+include debug_gym/envs/configs/*.yaml
+include scripts/templates/*.jinja
diff --git a/README.md b/README.md
@@ -137,28 +137,89 @@ We provide a human mode that enables developers to manually interact with `debug
 
 #### 3.3. Overriding Values in Config
 
-`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
+The `-p` flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
 
-    python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
+    python scripts/run.py scripts/config_aider.yaml \
+        --agent debug_agent \
+        -v \
+        -p debug_agent.llm_name="human" \
+        -p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"
 
-#### 3.4. Debugging a Custom Repository
+
+#### 3.4. Customizing the System Prompt with Jinja Templates
+
+`debug-gym` allows you to fully customize the system prompt by providing a [Jinja](https://jinja.palletsprojects.com/) template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.
+
+To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under `system_prompt_template_file`. For example:
+
+```yaml
+debug_agent:
+  system_prompt_template_file: scripts/templates/custom_system_prompt.jinja
+```
+
+Alternatively, you can provide a custom template from the command line with `-p <agent>.system_prompt_template_file="<path/to/template.jinja>"` (see above).
+
+Within your Jinja template, you have access to the `agent` and `info` objects, which provide all relevant context about the current environment and agent state.
+
+#### Custom Jinja Filters
+
+In addition to all [built-in Jinja filters](https://jinja.palletsprojects.com/en/stable/templates/#list-of-builtin-filters), two custom filters are available for use in your template:
+
+- **`to_pretty_json`**: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
+    ```jinja
+    {{ info.tools | to_pretty_json }}
+    ```
+
+- **`trim_message`**: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The `trim_message` filter accepts the following arguments to control how messages are trimmed:
+    - **`max_length`**: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
+    - **`max_length_percentage`**: Instead of specifying an absolute number, you can provide a percentage (e.g., `0.1` for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
+    - **`where`**: Specifies where to trim the message if it exceeds the limit. The default is `"middle"`, which trims from the middle of the message. Other options are `start` or `end`.
+
+    ```jinja
+    {{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
+    ```
+
+#### Example Template
+
+```jinja
+System Prompt for Debug-Gym
+
+Task: {{ agent.system_prompt }}
+
+Instructions:
+{{ info.instructions }}
+
+Directory Tree:
+{{ info.dir_tree | trim_message(max_length=1000) }}
+
+Current Breakpoints:
+{{ info.current_breakpoints | to_pretty_json }}
+
+{% if agent.shortcut_features() %}
+Shortcut Features:
+{{ agent.shortcut_features() | to_pretty_json }}
+{% endif %}
+```
+
+
+#### 3.5. Debugging a Custom Repository
 
 Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
 
 As an example, we provide a buggy pytorch code repository in `data/pytorch`.
 
     python scripts/run.py scripts/config.yaml --agent <agent name>
 
-#### 3.5. Debugging a Custom SWE-Smith Instance
+#### 3.6. Debugging a Custom SWE-Smith Instance
 
 [SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
 
     python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
 
-#### 3.6. Design Your Own Tool
+#### 3.7. Design Your Own Tool
 `debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
 
-#### 3.7. Analysis and Visualization
+#### 3.8. Analysis and Visualization
 
 We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
 - In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.

diff --git a/debug_gym/agents/base_agent.py b/debug_gym/agents/base_agent.py
@@ -2,10 +2,10 @@
 import os
 import subprocess
 import uuid
-from collections import OrderedDict
 from os.path import join as pjoin
 
 import numpy as np
+from jinja2 import Environment, Template
 
 from debug_gym.agents.history_tracker import HistoryTracker, build_history_prompt
 from debug_gym.agents.utils import trim
@@ -72,72 +72,120 @@ def parse_reasoning_model_response(self, response, reasoning_end_token):
             response = response[reasoning_end:].strip()
         return response
 
-    def build_system_prompt(self, info):
-        def calc_tokens_left(system_prompt: dict):
-            system_prompt = filter_non_utf8(
-                json.dumps(system_prompt, indent=2, sort_keys=False)
-            )
-            return self.llm.context_length - self.llm.count_tokens(system_prompt)
-
-        system_prompt = OrderedDict()
-        system_prompt["Overall task"] = self.system_prompt
-        system_prompt["Instructions"] = info.instructions
-        if self.llm.context_length is not None and self.llm.count_tokens is not None:
-            system_prompt["Repo directory tree"] = trim(
-                info.dir_tree,
-                min(
-                    int(0.1 * self.llm.context_length), calc_tokens_left(system_prompt)
-                ),
-                count_tokens=self.llm.count_tokens,
-                where="end",
+    def _auto_eval_on_rewrite(self):
+        """Check if auto eval on rewrite is enabled."""
+        return self.config.get("env_kwargs", {}).get("auto_eval_on_rewrite", False)
+
+    def shortcut_features(self):
+        features = []
+        if self._auto_eval_on_rewrite():
+            features.append(
+                "After successful rewrites, the environment will automatically "
+                "call the Eval tool to evaluate the rewritten code. Therefore, "
+                "you do not need to call the Eval tool yourself. The evaluation "
+                "output will be updated automatically in the system prompt."
             )
-        else:
-            system_prompt["Repo directory tree"] = info.dir_tree
-        system_prompt["Current breakpoints"] = info.current_breakpoints
+        if self.env.has_tool("pdb"):
+            if self.config.get("env_kwargs", {}).get("persistent_breakpoints"):
+                features.append(
+                    "The environment will automatically restore existing breakpoints "
+                    "when a new PDB session is started (e.g., after a rewrite)."
+                )
+            if self.config.get("env_kwargs", {}).get("auto_list"):
+                features.append(
+                    "After every valid PDB tool calling, the environment will "
+                    "automatically call the PDB tool again with a `list .` command, "
+                    "which will show the code around the current frame."
+                )
+        return features
+
+    @staticmethod
+    def to_pretty_json(value):
+        """Convert a value to a pretty JSON string."""
+        return json.dumps(value, indent=2, sort_keys=False)
 
-        if self.llm.context_length is not None and self.llm.count_tokens is not None:
-            system_prompt["Evaluation output of current code"] = trim(
+    def trim_message(
+        self,
+        message,
+        count_tokens=None,
+        max_length=None,
+        max_length_percentage=0,
+        where="middle",
+    ):
+        """Filter non utf8 and trim the message to fit within the token limit.
+        If the message exceeds the max_length, it will be trimmed to fit.
+        The `max_length` can be specified as an absolute value or a percentage
+        of the LLM's context length, if any."""
+        message = filter_non_utf8(message)
+        count_tokens = count_tokens or self.llm.count_tokens
+        max_length = (
+            max_length
+            or max_length_percentage * self.llm.context_length
+            or self.llm.context_length
+        )
+
+        if count_tokens is None or max_length is None or max_length <= 0:
+            return message
+        tokens = count_tokens(message)
+        if tokens > max_length:
+            return trim(message, max_length, count_tokens=count_tokens, where=where)
+        return message
+
+    def _load_system_prompt_template(self) -> Template | None:
+        """Load system prompt template from config if specified and register custom filters.
+        If no template is specified, return None.
+        """
+        system_prompt_template = self.config.get("system_prompt_template_file")
+        if system_prompt_template:
+            if not os.path.isfile(system_prompt_template):
+                error_msg = (
+                    f"System prompt template file `{system_prompt_template}` not found."
+                )
+                self.logger.error(error_msg)
+                raise FileNotFoundError(error_msg)
+            with open(system_prompt_template, "r") as f:
+                system_prompt_template = f.read()
+            # Add custom filter to Jinja2 environment
+            env = Environment()
+            env.filters["to_pretty_json"] = self.to_pretty_json
+            env.filters["trim_message"] = self.trim_message
+            return env.from_string(system_prompt_template)
+        return None
+
+    def _default_system_prompt(self, info) -> str:
+        """Return the default system prompt as pretty JSON.
+        Trimmed to fit within the token limit."""
+
+        system_prompt_dict = {
+            "Overall task": self.system_prompt,
+            "Instructions": info.instructions,
+            "Repo directory tree": self.trim_message(
+                info.dir_tree, max_length_percentage=0.1, where="end"
+            ),
+            "Current breakpoints": info.current_breakpoints,
+        }
+
+        if self._auto_eval_on_rewrite():
+            system_prompt_dict["Evaluation output of current code"] = self.trim_message(
                 info.eval_observation.observation,
-                min(
-                    int(0.8 * self.llm.context_length), calc_tokens_left(system_prompt)
-                ),
-                count_tokens=self.llm.count_tokens,
+                max_length_percentage=0.8,
                 where="middle",
             )
-        else:
-            system_prompt["Evaluation output of current code"] = (
-                info.eval_observation.observation
-            )
 
-        shortcut_features = []
-        if self.config.get("env_kwargs", {}).get("auto_eval_on_rewrite") is True:
-            shortcut_features.append(
-                "After successful rewrites, the environment will automatically call the Eval tool to evaluate the rewritten code. Therefore, you do not need to call the Eval tool yourself. The evaluation output will be updated automatically in the system prompt."
-            )
-        if self.config.get("env_kwargs", {}).get(
-            "persistent_breakpoints"
-        ) is True and self.env.has_tool("pdb"):
-            shortcut_features.append(
-                "The environment will automatically restore existing breakpoints when a new PDB session is started (e.g., after a rewrite)."
-            )
-        if self.config.get("env_kwargs", {}).get(
-            "auto_list"
-        ) is True and self.env.has_tool("pdb"):
-            shortcut_features.append(
-                "After every valid PDB tool calling, the environment will automatically call the PDB tool again with a `list .` command, which will show the code around the current frame."
-            )
-        if len(shortcut_features) > 0:
-            system_prompt["Shortcut features"] = shortcut_features
+        shortcut_features = self.shortcut_features()
+        if shortcut_features:
+            system_prompt_dict["Shortcut features"] = shortcut_features
 
-        system_prompt = filter_non_utf8(
-            json.dumps(system_prompt, indent=2, sort_keys=False)
-        )
-        messages = [
-            {
-                "role": "system",
-                "content": system_prompt,
-            }
-        ]
+        return self.to_pretty_json(system_prompt_dict)
+
+    def build_system_prompt(self, info):
+        """Build system prompt using jinja template from config or default template."""
+        system_prompt_template = self._load_system_prompt_template()
+        if system_prompt_template is not None:
+            system_prompt = system_prompt_template.render(agent=self, info=info)
+        else:
+            system_prompt = self._default_system_prompt(info)
+        messages = [{"role": "system", "content": filter_non_utf8(system_prompt)}]
         return messages
 
     def build_question_prompt(self):
@@ -146,7 +194,8 @@ def build_question_prompt(self):
         return messages
 
     def build_prompt(self, info):
-        messages = self.build_system_prompt(info)
+        messages = []
+        messages.extend(self.build_system_prompt(info))
         messages.extend(self.build_history_prompt())
         messages.extend(self.build_question_prompt())
         return messages

diff --git a/debug_gym/agents/debug_agent.py b/debug_gym/agents/debug_agent.py
@@ -1,5 +1,4 @@
 from debug_gym.agents.base_agent import BaseAgent, register_agent
-from debug_gym.llms.base import LLM
 
 
 @register_agent

diff --git a/debug_gym/agents/solution_agent.py b/debug_gym/agents/solution_agent.py
@@ -1,8 +1,4 @@
-import subprocess
-
 from debug_gym.agents.base_agent import BaseAgent, register_agent
-from debug_gym.gym.envs.swe_bench import SWEBenchEnv
-from debug_gym.gym.envs.swe_smith import SWESmithEnv
 from debug_gym.gym.tools.tool import ToolCall
 
 

diff --git a/debug_gym/gym/envs/aider.py b/debug_gym/gym/envs/aider.py
@@ -12,11 +12,8 @@ class AiderBenchmarkEnv(RepoEnv):
     REPO_PATH = Path.joinpath(Path.home(), ".cache", "debug_gym", "exercism")
 
     @property
-    def instructions(self):
-        return {
-            **super().instructions,
-            "Problem description": self.current_sample["instructions"],
-        }
+    def instructions(self) -> str:
+        return self.current_sample["instructions"]
 
     def __init__(self, entrypoint: str = "python -m pytest -s .", **kwargs):
         super().__init__(entrypoint=entrypoint, **kwargs)

diff --git a/debug_gym/gym/envs/env.py b/debug_gym/gym/envs/env.py
@@ -272,9 +272,8 @@ def cleanup_workspace(self):
             self.tempdir.cleanup()
 
     @property
-    def instructions(self):
-        _instruction = {}
-        return _instruction
+    def instructions(self) -> str:
+        return ""
 
     def display_files(self):
         msg = (

diff --git a/debug_gym/gym/envs/mini_nightmare.py b/debug_gym/gym/envs/mini_nightmare.py
@@ -22,11 +22,8 @@ class MiniNightmareEnv(RepoEnv):
     ]
 
     @property
-    def instructions(self):
-        return {
-            **super().instructions,
-            "Problem description": self.current_sample["instructions"],
-        }
+    def instructions(self) -> str:
+        return self.current_sample["instructions"]
 
     def __init__(self, entrypoint: str = "python -m pytest -s test.py", **kwargs):
         super().__init__(entrypoint=entrypoint, **kwargs)

diff --git a/debug_gym/gym/envs/swe_bench.py b/debug_gym/gym/envs/swe_bench.py
@@ -51,11 +51,8 @@ def __init__(
         self.test_directives = []
 
     @property
-    def instructions(self):
-        return {
-            **super().instructions,
-            "Problem description": self.ds_row["problem_statement"],
-        }
+    def instructions(self) -> str:
+        return self.ds_row["problem_statement"]
 
     def load_dataset(self):
         self.ds = datasets.load_dataset(self.dataset_id)[self.split]

diff --git a/requirements.txt b/requirements.txt
@@ -13,4 +13,5 @@ docker
 swebench==4.0.3
 swesmith
 prompt_toolkit
-anthropic>=0.49.0
+anthropic>=0.49.0
+jinja2
diff --git a/scripts/config.yaml b/scripts/config.yaml
@@ -8,7 +8,7 @@ base:
         "dir_tree_depth": 1,
         "run_timeout": 10,
         # shortcut features
-        "auto_eval_on_rewrite": True,  # The environment will automatically call the Eval tool after a successful rewrite. If this is set to True, the agent does not need to call the Eval tool itself.
+        "auto_eval_on_rewrite": False,  # The environment will automatically call the Eval tool after a successful rewrite. If this is set to True, the agent does not need to call the Eval tool itself.
         "persistent_breakpoints": True,  # The environemnt will keep a set of breakpoint states across PDB sessions. When a new PDB session is started, the environment will automatically load the breakpoints from the previous session.
         "auto_list": True,  # The environment will automatically call `list .` via the PDB tool after every pdb tool call, which will show the code around the current frame.
     }
@@ -33,6 +33,8 @@ base:
     save_patch: True
     log_prompt_response_pairs: True
     reset_prompt_history_after_rewrite: True
+    # Optionally loads a custom system prompt template from a file.
+    # system_prompt_template_file: "script/templates/system_prompt.jinja"
 
 rewrite_agent:
     tools: ["view", "rewrite", "eval"]