PrimeIntellect-ai · willccbb · Jan 21, 2026 · Jan 15, 2026 · Jan 15, 2026 · Jan 15, 2026
diff --git a/configs/eval/debug.toml b/configs/eval/debug.toml
@@ -0,0 +1,8 @@
+model = "openai/gpt-5-mini"
+
+[[eval]]
+env_id = "primeintellect/wiki-search"
+
+
+[[eval]]
+env_id = "primeintellect/math-python"
diff --git a/configs/eval/minimal.toml b/configs/eval/minimal.toml
@@ -0,0 +1,3 @@
+[[eval]]
+env_id = "gsm8k"
+
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -12,6 +12,9 @@ This section explains how to run evaluations with Verifiers environments. See [E
   - [Concurrency](#concurrency)
   - [Output and Saving](#output-and-saving)
 - [Environment Defaults](#environment-defaults)
+- [Multi-Environment Evaluation](#multi-environment-evaluation)
+  - [TOML Configuration](#toml-configuration)
+  - [Configuration Precedence](#configuration-precedence)
 
 Use `prime eval` to execute rollouts against any OpenAI-compatible model and report aggregate metrics.
 
@@ -32,12 +35,16 @@ prime eval run my-env -m gpt-4.1-mini -n 10
 
 | Flag | Short | Default | Description |
 |------|-------|---------|-------------|
-| `env_id` | (positional) | — | Environment module name (e.g., `my-env` or `gsm8k`) |
+| `env_id_or_path` | (positional) | — | Environment ID(s) or path to TOML config |
 | `--env-args` | `-a` | `{}` | JSON object passed to `load_environment()` |
 | `--extra-env-kwargs` | `-x` | `{}` | JSON object passed to environment constructor |
 | `--env-dir-path` | `-p` | `./environments` | Base path for saving output files |
 
-The `env_id` is converted to a Python module name (`my-env` → `my_env`) and imported. The module must be installed (via `vf-install` or `uv pip install`).
+The positional argument accepts two formats:
+- **Single environment**: `gsm8k` — evaluates one environment
+- **TOML config path**: `configs/eval/benchmark.toml` — evaluates multiple environments defined in the config file
+
+Environment IDs are converted to Python module names (`my-env` → `my_env`) and imported. Modules must be installed (via `prime env install` or `uv pip install`).
 
 The `--env-args` flag passes arguments to your `load_environment()` function:
 
@@ -152,4 +159,92 @@ num_examples = 100
 rollouts_per_example = 5
 ```
 
-These defaults are used when flags aren't explicitly provided. Priority order: CLI flags → environment defaults → global defaults.
+These defaults are used when higher-priority sources don't specify a value. The full priority order is:
+
+1. TOML per-environment settings (when using a config file)
+2. CLI flags
+3. Environment defaults (from `pyproject.toml`)
+4. Global defaults
+
+See [Configuration Precedence](#configuration-precedence) for more details on multi-environment evaluation.
+
+## Multi-Environment Evaluation
+
+You can evaluate multiple environments using `prime eval` with a TOML configuration file. This is useful for running comprehensive benchmark suites.
+
+### TOML Configuration
+
+For multi-environment evals or fine-grained control over settings, use a TOML configuration file. When using a config file, CLI arguments are ignored.
+
+```bash
+prime eval run configs/eval/my-benchmark.toml
+```
+
+The TOML file uses `[[eval]]` sections to define each evaluation. You can also specify global defaults at the top:
+
+```toml
+# configs/eval/my-benchmark.toml
+
+# Global defaults (optional)
+model = "openai/gpt-4.1-mini"
+num_examples = 50
+
+[[eval]]
+env_id = "gsm8k"
+num_examples = 100  # overrides global default
+rollouts_per_example = 5
+
+[[eval]]
+env_id = "alphabet-sort"
+# Uses global num_examples (50)
+rollouts_per_example = 3
+
+[[eval]]
+env_id = "math-python"
+# Uses global defaults and built-in defaults for unspecified values
+```
+
+A minimal config requires only a single `[[eval]]` section:
+
+```toml
+[[eval]]
+env_id = "gsm8k"
+```
+
+Each `[[eval]]` section must contain an `env_id` field. All other fields are optional:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `env_id` | string | **Required.** Environment module name |
+| `env_args` | table | Arguments passed to `load_environment()` |
+| `num_examples` | integer | Number of dataset examples to evaluate |
+| `rollouts_per_example` | integer | Rollouts per example |
+| `extra_env_kwargs` | table | Arguments passed to environment constructor |
+| `model` | string | Model to evaluate |
+
+Example with `env_args`:
+
+```toml
+[[eval]]
+env_id = "math-python"
+num_examples = 50
+
+[eval.env_args]
+difficulty = "hard"
+split = "test"
+```
+
+### Configuration Precedence
+
+When using a **config file**, CLI arguments are ignored. Settings are resolved as:
+
+1. **TOML per-eval settings** — Values specified in `[[eval]]` sections
+2. **TOML global settings** — Values at the top of the config file
+3. **Environment defaults** — Values from the environment's `pyproject.toml`
+4. **Built-in defaults** — (`num_examples=5`, `rollouts_per_example=3`)
+
+When using **CLI only** (no config file), settings are resolved as:
+
+1. **CLI arguments** — Flags passed on the command line
+2. **Environment defaults** — Values from the environment's `pyproject.toml`
+3. **Built-in defaults** — (`num_examples=5`, `rollouts_per_example=3`)