overhaul saving outputs by mikasenghaas · Pull Request #774 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-01-23T15:46:10Z

Description

This PR overhauls utilities and types around saving output results. It tries to solve the following two common painpoints:

Previously, we first built a Dataset which often runs into (avoidable) PyArrow type issues because the type inference system would look at the first row to determine a type but if it differs (quite often the case for e.g. info it would fail when saving later rows)
If any exception occurs during saving, the entire saving failed which, losing a lot of valuable data

We refactor GenerateOutputs to simply contain a list of serialize states, called RolloutOutpu and overhaul the saving and eval utilities to reflect the new format and robustify the logic:

GenerateOutputs is now in row-order and contains all (finished) states
Add a new defaut serializer which gracefully handles types which are not JSON-serilizable by default (e.g. Pydantic models, date, time, paths, exceptions etc.) + add tests for it
We save results.jsonl line-by-line and catch exceptions so that only rollouts with errors do not get saved

class RolloutOutput(dict):
    """Serialized output from a rollout (mirrors RolloutInput).

    A dict subclass that allows typed access to known fields while supporting
    arbitrary additional fields from state_columns. All values must be
    JSON-serializable.

    Required fields: example_id, task, prompt, completion, reward, timing,
                     is_completed, is_truncated, metrics
    Optional fields: answer, info, error, stop_condition, trajectory, oai_tools
    Additional fields: arbitrary serializable state_columns
    """

    # Required fields
    example_id: int
    task: str
    prompt: Messages | None
    completion: Messages | None
    reward: float
    timing: RolloutTiming
    is_completed: bool
    is_truncated: bool
    metrics: dict[str, float]
    # Optional fields
    answer: str
    info: Info
    error: str | None
    stop_condition: str | None
    trajectory: list["TrajectoryStep"]
    oai_tools: list["ChatCompletionToolParam"]

class GenerateOutputs(TypedDict):
    states: list[RolloutOutput]
    metadata: GenerateMetadata

We save exactly the same keys in metadata.json and each row in results.jsonl, making this PR fully backwards compatible.

Misc Changes

Add fixtures make_input, make_state, make_output,make_metadata to reduce redundancy in tests
Deprecate ProcessedOutputs (unused)

Example

The metadata and results have the exact same format as before

uv run vf-eval gsm8k -n5 -r3 -s

We can now safely save the full trajectory (including the raw OAI responses)

uv run vf-eval gsm8k -n5 -r3 -s -C trajectory

And even save non-serializable objects like clients which previously crashed runs

uv run vf-eval gsm8k -n5 -r3 -s -C client

Another example of an eval run that was crashing before (and motivated this PR) on a Pyarrow issue but is working now is the tau2-bench

uv run vf-eval tau2-bench -n -1 -r 2 -s -R

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Introduces a new results format and robust save pipeline centered on RolloutOutput.

Replace GenerateOutputs fields with outputs: list[RolloutOutput]; add typed RolloutOutput and update docs
New save utilities (GenerateOutputsBuilder, make_serializable, save_generate_outputs) for incremental, line-by-line JSONL saving with graceful handling of non-JSON types and partial failures
Refactor Environment.generate() to build/sort outputs incrementally, update callbacks to use serialized outputs
Update adapters and utilities (GEPA adapter, RL orchestrator, eval display, dataset builders) to consume outputs instead of per-field arrays
Add fixtures (make_input, make_state, make_output, make_metadata) and rewrite tests to the new schema; remove deprecated ProcessedOutputs
Extend metadata with tools; improve dataset/save helpers to read from serialized outputs

^{Written by Cursor Bugbot for commit 7db388b. This will update automatically on new commits. Configure here.}

mikasenghaas · 2026-01-23T15:47:13Z

@cursoragent review

verifiers/utils/eval_utils.py

verifiers/utils/save_utils.py

verifiers/gepa/adapter.py

verifiers/rl/trainer/orchestrator.py

docs/reference.md

verifiers/utils/eval_utils.py

tests/test_singleturn_env.py

* serializable output type; generateoutputs builder * serializable output type; generateoutputs builder * bugbot fixes * bugbot fixes * bugbot fixes * bugbot fixes

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

tests/conftest.py

verifiers/utils/save_utils.py

This comment was marked as outdated.

Sign in to view

mikasenghaas added 6 commits January 25, 2026 12:13

do not use dataset for local saving

c8515d0

correctly convert generate outputs

dc6e80e

rewrite generate outputs to row-order

ff76495

only save once

d2352c5

repr errors in saved outputs

b6fe5ef

move to save_utils

d089d2c

mikasenghaas force-pushed the overhaul-results-saving branch from b9f35f5 to d089d2c Compare January 25, 2026 12:16

mikasenghaas added 21 commits January 25, 2026 12:23

try..except around saving utils

51f3adc

flatten metrics

c5da446

pop answer and info if not present

86201b2

some more comments

713acb9

deprecate ProcessedOutputs

005b15f

just use state instead of rollout output

151b1e0

make generic serialization util

481f346

simplify

a09bd0c

opus tests

07a1dfa

add tests

128edd7

fix tests

952454f

add fixtures for input, metadata and state

1f387c3

use conftests across whole test suite

115e3a4

fix ty

dccdd0c

remove debug

9a51331

mini fix

fc0bab6

fully backwards compatible

7f195f4

undo shallow copy

c3c6503

revert unnecessary changes

d359f98

remove test

bf21b63

remove debug scripts

9a1f2d2

mikasenghaas requested a review from willccbb January 26, 2026 15:23

mikasenghaas marked this pull request as ready for review January 26, 2026 15:23

cursor bot reviewed Jan 26, 2026

View reviewed changes

verifiers/utils/eval_utils.py Show resolved Hide resolved

verifiers/utils/save_utils.py Outdated Show resolved Hide resolved

verifiers/gepa/adapter.py Outdated Show resolved Hide resolved

mikasenghaas added 4 commits January 26, 2026 15:35

retain null if no error

64d7e0c

handle empty list

850b9d7

bring back quiet_datasets()

72f7ded

fix index error

8404b05

mikasenghaas requested a review from rasdani January 26, 2026 15:52

cursor bot reviewed Jan 26, 2026

View reviewed changes

verifiers/rl/trainer/orchestrator.py Outdated Show resolved Hide resolved

docs/reference.md Show resolved Hide resolved

verifiers/utils/eval_utils.py Show resolved Hide resolved

mikasenghaas added 2 commits January 26, 2026 16:23

fix reward,prompt,completion and metrics collection

096735a

update docs

0fcd330

cursor bot reviewed Jan 26, 2026

View reviewed changes

tests/test_singleturn_env.py Outdated Show resolved Hide resolved

mikasenghaas mentioned this pull request Jan 26, 2026

anthropic client + interleaved thinking support #788

Closed

24 tasks

mikasenghaas and others added 3 commits January 27, 2026 10:26

fix test regression

72506ed

Outputs builder (#792)

a822276

* serializable output type; generateoutputs builder * serializable output type; generateoutputs builder * bugbot fixes * bugbot fixes * bugbot fixes * bugbot fixes

Merge branch 'main' into overhaul-results-saving

7db388b

cursor bot reviewed Jan 28, 2026

View reviewed changes

tests/conftest.py Show resolved Hide resolved

verifiers/utils/save_utils.py Show resolved Hide resolved

willccbb approved these changes Jan 28, 2026

View reviewed changes

willccbb merged commit ae76c0a into main Jan 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

overhaul saving outputs#774

overhaul saving outputs#774
willccbb merged 36 commits intomainfrom
overhaul-results-saving

mikasenghaas commented Jan 23, 2026 •

edited by cursor bot

Loading

Uh oh!

mikasenghaas commented Jan 23, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mikasenghaas commented Jan 23, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Misc Changes

Example

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

mikasenghaas commented Jan 23, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jan 23, 2026 •

edited by cursor bot

Loading