Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
07d75d5
eval improvement Phase 1: Add filelock and xxhash dependencies to pyp…
JoyboyBrian Feb 21, 2026
fa0ea8c
eval enhancement phase 2: Enhance EvalRunner with batch processing ca…
JoyboyBrian Feb 21, 2026
533098d
eval enhancement phase 3: Implemented 'run_batch' method for concurre…
JoyboyBrian Feb 21, 2026
36e410f
eval improvement phase 4: Implement rubric cache migration and enhanc…
JoyboyBrian Feb 21, 2026
d3b05f3
eval enhancement phase 5: Introduce improved error handling in cache …
JoyboyBrian Feb 21, 2026
82419ca
Enhance evaluation framework: Introduce retry mechanism for failed ru…
JoyboyBrian Feb 21, 2026
09bfe76
Enhance evaluation framework: Add dataset fingerprint verification to…
JoyboyBrian Feb 21, 2026
c84c866
Enhance EvalCommand output: Add user tips for cache re-running and lo…
JoyboyBrian Feb 21, 2026
76fabda
Enhance module fingerprinting and progress tracking: Introduce suppor…
JoyboyBrian Feb 21, 2026
c344747
Refine EvalOrchestrator's retry mechanism: Update handling of complet…
JoyboyBrian Feb 21, 2026
e157719
Enhance LLM client and evaluation framework: Add properties for API k…
JoyboyBrian Feb 22, 2026
e4ba238
Refactor error handling in cache operations: Replace print statements…
JoyboyBrian Feb 23, 2026
deff10f
Refactor cache functions for improved clarity and consistency: Rename…
JoyboyBrian Feb 23, 2026
8c2127b
Refactor logging in session and cache management: Replace print state…
JoyboyBrian Feb 23, 2026
711d111
Enhance eval functionality and caching: Update documentation to inclu…
JoyboyBrian Feb 23, 2026
695f1c5
Enhance CLI cache management: Introduce new commands for listing and …
JoyboyBrian Feb 23, 2026
6886c9a
format
JoyboyBrian Feb 23, 2026
7ecc181
fix pyright type checking issue
JoyboyBrian Feb 23, 2026
cf3d3f1
address pyright type checking issues
JoyboyBrian Feb 23, 2026
847974d
Improve error handling and logging in evaluation components: Enhance …
JoyboyBrian Feb 23, 2026
b66af47
Enable address reuse for local server socket: Add socket option to al…
JoyboyBrian Feb 23, 2026
2a9b8e4
address cubic review
JoyboyBrian Feb 23, 2026
af05cf5
fix
JoyboyBrian Feb 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 61 additions & 2 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ See [Test Mode](./test-mode.md) for full documentation on dataset format, intera

### osmosis eval

Evaluate trained models with custom eval functions and pass@k metrics. Works with both Local Rollout (MCP tools) and Remote Rollout (RolloutAgentLoop) agents.
Evaluate trained models with custom eval functions and pass@k metrics. Works with both Local Rollout (MCP tools) and Remote Rollout (RolloutAgentLoop) agents. Results are automatically cached to disk so interrupted evaluations can be resumed.

```bash
osmosis eval -m server:agent_loop -d data.jsonl \
Expand All @@ -86,9 +86,68 @@ osmosis eval -m server:agent_loop -d data.jsonl \

osmosis eval --mcp ./mcp -d data.jsonl \
--eval-fn rewards:compute_reward --model openai/gpt-5-mini

# Resume automatically — re-run the same command after interruption
# Force fresh start, discarding cached results
osmosis eval -m server:agent_loop -d data.jsonl \
--eval-fn rewards:compute_reward --model my-model --fresh

# Re-run only failed runs from a previous evaluation
osmosis eval -m server:agent_loop -d data.jsonl \
--eval-fn rewards:compute_reward --model my-model --retry-failed

# Save conversation logs alongside results
osmosis eval -m server:agent_loop -d data.jsonl \
--eval-fn rewards:compute_reward --model my-model --log-samples
```

**Additional Options:**

| Option | Description |
|--------|-------------|
| `--fresh` | Force restart, discarding cached results |
| `--retry-failed` | Re-execute only failed runs (mutually exclusive with `--fresh`) |
| `--log-samples` | Save full conversation messages to JSONL |
| `--output-path DIR` | Write results to structured directory |

### osmosis eval cache

Manage the eval result cache.

```bash
# Print the cache root directory path
osmosis eval cache dir

# List cached evaluations
osmosis eval cache ls
osmosis eval cache ls --model gpt-4 --status completed

# Remove cached evaluations
osmosis eval cache rm <task_id>
osmosis eval cache rm --all --yes
osmosis eval cache rm --status in_progress --yes
```

See [Eval Mode](./eval-mode.md) for full documentation on eval functions, pass@k metrics, and output formats.
**`osmosis eval cache ls` options:**

| Option | Description |
|--------|-------------|
| `--model NAME` | Filter by model name (case-insensitive substring) |
| `--dataset NAME` | Filter by dataset path (case-insensitive substring) |
| `--status STATUS` | Filter by status (`in_progress` or `completed`) |

**`osmosis eval cache rm` options:**

| Option | Description |
|--------|-------------|
| `TASK_ID` | Task ID of the cache entry to delete (no confirmation) |
| `--all` | Delete all cached evaluations |
| `--model NAME` | Filter by model name (case-insensitive substring) |
| `--dataset NAME` | Filter by dataset path (case-insensitive substring) |
| `--status STATUS` | Filter by status (`in_progress` or `completed`) |
| `-y`, `--yes` | Skip confirmation prompt for batch deletions |

See [Eval Mode](./eval-mode.md) for full documentation on eval functions, pass@k metrics, caching, and output formats.

## Remote Rollout Server

Expand Down
11 changes: 11 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,17 @@ Server-side configuration for the RolloutServer process.
|----------|------|---------|-------|-------------|
| `OSMOSIS_ROLLOUT_MAX_METADATA_SIZE_BYTES` | `int` | `1048576` (1 MB) | 1024 -- 104857600 (100 MB) | Maximum allowed size for rollout metadata in bytes. |

### Eval Cache Settings

These environment variables control the behavior of `osmosis eval` result caching.

| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `OSMOSIS_CACHE_DIR` | `str` | - | Override the eval cache root directory. When set, cache files are stored under `$OSMOSIS_CACHE_DIR/eval/`. |
| `OSMOSIS_EVAL_LOCK_TIMEOUT` | `int` | `30` | Timeout in seconds for acquiring the cache file lock. If another eval with the same config is running, the process waits up to this duration before failing. Must be a positive integer. |

When `OSMOSIS_CACHE_DIR` is not set, the cache follows the XDG Base Directory convention: `$XDG_CACHE_HOME/osmosis/eval/` (defaults to `~/.cache/osmosis/eval/`).

## Programmatic Configuration

Configuration is organized into three Pydantic Settings classes. When the `pydantic-settings` package is installed (included in the `server` extra), these classes automatically read from environment variables and `.env` files. Without `pydantic-settings`, they fall back to plain Pydantic models that only accept values passed programmatically:
Expand Down
153 changes: 152 additions & 1 deletion docs/eval-mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,10 +80,143 @@ Formula: `pass@k = 1 - C(n-c, k) / C(n, k)` where `n` = total runs, `c` = passin

---

## Result Caching & Resume

Eval mode automatically caches results to disk so evaluations can be **interrupted and resumed** without losing progress. This is especially useful for long-running evaluations with many rows or multiple runs per row.

### How It Works

1. When an evaluation starts, a **task ID** is computed from the full configuration (model, dataset, eval functions, parameters, and source code fingerprints).
2. Results are written to a JSON cache file under `~/.cache/osmosis/eval/` (or `$OSMOSIS_CACHE_DIR/eval/`), organized by `{model}/{dataset}/`.
3. If the evaluation is interrupted (Ctrl+C, SIGTERM, or crash), re-running the **same command** automatically resumes from where it left off.
4. A file lock prevents concurrent evaluations with the same configuration from conflicting.

### Cache Invalidation

The cache is keyed on a fingerprint that includes:

- **Module source code** — changes to your agent's `.py` file (or entire package directory for packages, or MCP directory) invalidate the cache.
- **Eval function source code** — changes to any eval function's source file invalidate the cache.
- **Dataset content** — the dataset file is fingerprinted; any modification is detected.
- **All CLI parameters** — model, base URL, `--n`, `--max-turns`, `--pass-threshold`, `--temperature`, `--max-tokens`, `--offset`, `--limit`.

If any of these change, a new cache entry is created automatically.

> **Note**: Module fingerprinting covers the agent's own source file (or package directory). Changes to external dependencies (e.g., a library your agent imports) are **not** detected. Use `--fresh` to force a clean restart when external imports change.

### Resuming After Interruption

Simply re-run the exact same command:

```bash
# First run — interrupted at row 50/100
osmosis eval -m server:agent_loop -d data.jsonl --eval-fn rewards:score --model my-model
# ^C (interrupted)

# Second run — resumes from row 50
osmosis eval -m server:agent_loop -d data.jsonl --eval-fn rewards:score --model my-model
```

When resuming, the CLI displays:

```
Resuming eval (50/100 runs completed)
Note: Module fingerprint covers file server.py. External dependency changes
are not detected. Use --fresh if you changed external imports.
```

### Dataset Integrity

During evaluation, the dataset file is periodically re-checked (every 100 runs or 5 minutes) to detect modifications. If the dataset changes mid-evaluation, the run stops with an error:

```
Error: Dataset was modified during evaluation. Results may be inconsistent. Use --fresh to restart.
```

If a completed evaluation is loaded from cache but the dataset has since changed, a warning is displayed alongside the cached results.

### Cache Management

```bash
# Print the cache root directory path
osmosis eval cache dir

# List all cached evaluations
osmosis eval cache ls

# List with filters
osmosis eval cache ls --model gpt-4
osmosis eval cache ls --status in_progress
osmosis eval cache ls --dataset my_data

# Remove a specific cached evaluation by task ID
osmosis eval cache rm <task_id>

# Remove all cached evaluations (with confirmation prompt)
osmosis eval cache rm --all

# Remove with filters (skip confirmation with -y)
osmosis eval cache rm --status in_progress --yes
osmosis eval cache rm --model gpt-4 --yes

# Force a fresh evaluation, discarding cached results
osmosis eval -m server:agent_loop -d data.jsonl --eval-fn rewards:score --model my-model --fresh

# Re-run only failed runs from a previous evaluation
osmosis eval -m server:agent_loop -d data.jsonl --eval-fn rewards:score --model my-model --retry-failed
```

The `--fresh` flag backs up existing cache files (with a `.backup.{timestamp}` suffix) before creating a new cache.

`--fresh` and `--retry-failed` are mutually exclusive.

---

## Conversation Logging

Use `--log-samples` to save the full conversation messages for each run to a JSONL file alongside the cache:

```bash
osmosis eval -m server:agent_loop -d data.jsonl --eval-fn rewards:score \
--model my-model --log-samples
```

Each line in the samples file is a JSON object containing `row_index`, `run_index`, `model_tag` (if using baseline comparison), and the full `messages` array.

> **Note**: When resuming a previously interrupted evaluation, only new runs are logged. Prior runs from the cache do not have their messages retroactively saved. Use `--fresh --log-samples` if you need complete logs for all runs.

---

## Structured Output

Use `--output-path` to write results to a structured directory:

```bash
osmosis eval -m server:agent_loop -d data.jsonl --eval-fn rewards:score \
--model my-model --output-path ./results
```

This creates:

```
results/
{model}/
{dataset}/
results_{timestamp}_{task_id}.json
samples_{timestamp}_{task_id}.jsonl # if --log-samples is used
```

The results JSON uses the same schema as the internal cache file, with `status` always set to `"completed"`.

The legacy `-o`/`--output` flag is still supported and writes a single JSON file with the original nested format.

---

## CLI Reference

```
osmosis eval [OPTIONS]
osmosis eval cache dir
```

### Required Options
Expand Down Expand Up @@ -130,21 +263,38 @@ osmosis eval [OPTIONS]
| `--limit N` | all | Maximum rows to benchmark |
| `--offset N` | `0` | Number of rows to skip |

### Cache & Resume Options

| Option | Description |
|--------|-------------|
| `--fresh` | Force restart evaluation from scratch, discarding any cached results (backs up existing cache) |
| `--retry-failed` | Re-execute only failed runs from a previous evaluation. Mutually exclusive with `--fresh`. |

### Output Options

| Option | Description |
|--------|-------------|
| `-o, --output FILE` | Write results to JSON file |
| `-o, --output FILE` | Write results to JSON file (legacy format) |
| `--output-path DIR` | Write results to structured directory (`{model}/{dataset}/results_{ts}_{id}.json`) |
| `--log-samples` | Save full conversation messages to a JSONL file alongside the cache |
| `-q, --quiet` | Suppress progress output |
| `--debug` | Enable debug logging |

### Cache Management Subcommands

| Command | Description |
|---------|-------------|
| `osmosis eval cache dir` | Print the cache root directory path |

---

## Exceptions

| Exception | Description |
|-----------|-------------|
| `EvalFnError` | Eval function loading, signature detection, or execution error |
| `TimeoutError` | Another evaluation with the same config is already running (lock contention) |
| `RuntimeError` | Cache version mismatch, config hash collision, or dataset fingerprint mismatch |

All other exceptions are shared with [Test Mode](./test-mode.md#exceptions).

Expand All @@ -154,4 +304,5 @@ All other exceptions are shared with [Test Mode](./test-mode.md#exceptions).

- [Test Mode](./test-mode.md) -- Test agent logic with external LLMs
- [Dataset Format](./datasets.md) -- Supported formats and required columns
- [Configuration](./configuration.md) -- Environment variables including cache settings
- [LiteLLM Providers](https://docs.litellm.ai/docs/providers) -- Supported LLM providers
64 changes: 64 additions & 0 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,70 @@ object with at least a `name`.
authentication failure, budget exhausted). The batch aborts early instead of
retrying each row. Fix the underlying credential or connectivity issue and re-run.

## Eval Cache Errors

### Lock contention

```
TimeoutError: Another eval with the same config is already running.
```

This means another `osmosis eval` process with the same configuration is already
holding the cache lock. Wait for it to finish, or increase the timeout:

```bash
export OSMOSIS_EVAL_LOCK_TIMEOUT=120 # seconds (default: 30)
```

If the other process crashed without releasing the lock, the lock file (`.lock`)
will be automatically released when the process exits. If a stale lock persists,
you can manually delete it:

```bash
# Find the cache directory
osmosis eval cache dir
# Delete the stale lock file
rm ~/.cache/osmosis/eval/{model}/{dataset}/{task_id}.lock
```

### Dataset changed during evaluation

```
Error: Dataset was modified during evaluation. Results may be inconsistent. Use --fresh to restart.
```

The dataset file was modified while the evaluation was in progress. Re-run with
`--fresh` to start from scratch with the current dataset.

### Dataset changed since cached evaluation

```
Warning: Dataset file has changed since this eval completed.
```

A completed evaluation is loaded from cache but the dataset file has been
modified since. The displayed results are from the original dataset. Use
`--fresh` to re-evaluate with the current dataset.

### Cache version mismatch

```
RuntimeError: Cache file created by a newer version of osmosis (vN).
```

The cache file was created by a newer version of the SDK. Upgrade osmosis or
use `--fresh` to discard the incompatible cache and start a new evaluation.

### Config hash collision

```
RuntimeError: Cache file belongs to a different eval configuration
```

Extremely rare: two different configurations produced the same short task ID.
Re-run with a slightly different parameter (e.g., add `--temperature 0.7`)
or manually delete the conflicting cache file.

## Remote Rollout Errors

### AgentLoopValidationError
Expand Down
1 change: 1 addition & 0 deletions osmosis_ai/auth/local_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ def find_available_port() -> int | None:
for port in range(LOCAL_SERVER_PORT_START, LOCAL_SERVER_PORT_END + 1):
try:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("localhost", port))
return port
except OSError:
Expand Down
Loading