Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions configs/eval/debug.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model = "openai/gpt-5-mini"

[[eval]]
env_id = "primeintellect/wiki-search"


[[eval]]
env_id = "primeintellect/math-python"
3 changes: 3 additions & 0 deletions configs/eval/minimal.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[[eval]]
env_id = "gsm8k"

101 changes: 98 additions & 3 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ This section explains how to run evaluations with Verifiers environments. See [E
- [Concurrency](#concurrency)
- [Output and Saving](#output-and-saving)
- [Environment Defaults](#environment-defaults)
- [Multi-Environment Evaluation](#multi-environment-evaluation)
- [TOML Configuration](#toml-configuration)
- [Configuration Precedence](#configuration-precedence)

Use `prime eval` to execute rollouts against any OpenAI-compatible model and report aggregate metrics.

Expand All @@ -32,12 +35,16 @@ prime eval run my-env -m gpt-4.1-mini -n 10

| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `env_id` | (positional) | — | Environment module name (e.g., `my-env` or `gsm8k`) |
| `env_id_or_path` | (positional) | — | Environment ID(s) or path to TOML config |
| `--env-args` | `-a` | `{}` | JSON object passed to `load_environment()` |
| `--extra-env-kwargs` | `-x` | `{}` | JSON object passed to environment constructor |
| `--env-dir-path` | `-p` | `./environments` | Base path for saving output files |

The `env_id` is converted to a Python module name (`my-env` → `my_env`) and imported. The module must be installed (via `vf-install` or `uv pip install`).
The positional argument accepts two formats:
- **Single environment**: `gsm8k` — evaluates one environment
- **TOML config path**: `configs/eval/benchmark.toml` — evaluates multiple environments defined in the config file

Environment IDs are converted to Python module names (`my-env` → `my_env`) and imported. Modules must be installed (via `prime env install` or `uv pip install`).

The `--env-args` flag passes arguments to your `load_environment()` function:

Expand Down Expand Up @@ -152,4 +159,92 @@ num_examples = 100
rollouts_per_example = 5
```

These defaults are used when flags aren't explicitly provided. Priority order: CLI flags → environment defaults → global defaults.
These defaults are used when higher-priority sources don't specify a value. The full priority order is:

1. TOML per-environment settings (when using a config file)
2. CLI flags
3. Environment defaults (from `pyproject.toml`)
4. Global defaults

See [Configuration Precedence](#configuration-precedence) for more details on multi-environment evaluation.

## Multi-Environment Evaluation

You can evaluate multiple environments using `prime eval` with a TOML configuration file. This is useful for running comprehensive benchmark suites.

### TOML Configuration

For multi-environment evals or fine-grained control over settings, use a TOML configuration file. When using a config file, CLI arguments are ignored.

```bash
prime eval run configs/eval/my-benchmark.toml
```

The TOML file uses `[[eval]]` sections to define each evaluation. You can also specify global defaults at the top:

```toml
# configs/eval/my-benchmark.toml

# Global defaults (optional)
model = "openai/gpt-4.1-mini"
num_examples = 50

[[eval]]
env_id = "gsm8k"
num_examples = 100 # overrides global default
rollouts_per_example = 5

[[eval]]
env_id = "alphabet-sort"
# Uses global num_examples (50)
rollouts_per_example = 3

[[eval]]
env_id = "math-python"
# Uses global defaults and built-in defaults for unspecified values
```

A minimal config requires only a single `[[eval]]` section:

```toml
[[eval]]
env_id = "gsm8k"
```

Each `[[eval]]` section must contain an `env_id` field. All other fields are optional:

| Field | Type | Description |
|-------|------|-------------|
| `env_id` | string | **Required.** Environment module name |
| `env_args` | table | Arguments passed to `load_environment()` |
| `num_examples` | integer | Number of dataset examples to evaluate |
| `rollouts_per_example` | integer | Rollouts per example |
| `extra_env_kwargs` | table | Arguments passed to environment constructor |
| `model` | string | Model to evaluate |

Example with `env_args`:

```toml
[[eval]]
env_id = "math-python"
num_examples = 50

[eval.env_args]
difficulty = "hard"
split = "test"
```

### Configuration Precedence

When using a **config file**, CLI arguments are ignored. Settings are resolved as:

1. **TOML per-eval settings** — Values specified in `[[eval]]` sections
2. **TOML global settings** — Values at the top of the config file
3. **Environment defaults** — Values from the environment's `pyproject.toml`
4. **Built-in defaults** — (`num_examples=5`, `rollouts_per_example=3`)

When using **CLI only** (no config file), settings are resolved as:

1. **CLI arguments** — Flags passed on the command line
2. **Environment defaults** — Values from the environment's `pyproject.toml`
3. **Built-in defaults** — (`num_examples=5`, `rollouts_per_example=3`)
Loading
Loading