Skip to content

Comments

overhaul saving outputs#774

Merged
willccbb merged 36 commits intomainfrom
overhaul-results-saving
Jan 28, 2026
Merged

overhaul saving outputs#774
willccbb merged 36 commits intomainfrom
overhaul-results-saving

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 23, 2026

Description

This PR overhauls utilities and types around saving output results. It tries to solve the following two common painpoints:

  • Previously, we first built a Dataset which often runs into (avoidable) PyArrow type issues because the type inference system would look at the first row to determine a type but if it differs (quite often the case for e.g. info it would fail when saving later rows)
  • If any exception occurs during saving, the entire saving failed which, losing a lot of valuable data

We refactor GenerateOutputs to simply contain a list of serialize states, called RolloutOutpu and overhaul the saving and eval utilities to reflect the new format and robustify the logic:

  • GenerateOutputs is now in row-order and contains all (finished) states
  • Add a new defaut serializer which gracefully handles types which are not JSON-serilizable by default (e.g. Pydantic models, date, time, paths, exceptions etc.) + add tests for it
  • We save results.jsonl line-by-line and catch exceptions so that only rollouts with errors do not get saved
class RolloutOutput(dict):
    """Serialized output from a rollout (mirrors RolloutInput).

    A dict subclass that allows typed access to known fields while supporting
    arbitrary additional fields from state_columns. All values must be
    JSON-serializable.

    Required fields: example_id, task, prompt, completion, reward, timing,
                     is_completed, is_truncated, metrics
    Optional fields: answer, info, error, stop_condition, trajectory, oai_tools
    Additional fields: arbitrary serializable state_columns
    """

    # Required fields
    example_id: int
    task: str
    prompt: Messages | None
    completion: Messages | None
    reward: float
    timing: RolloutTiming
    is_completed: bool
    is_truncated: bool
    metrics: dict[str, float]
    # Optional fields
    answer: str
    info: Info
    error: str | None
    stop_condition: str | None
    trajectory: list["TrajectoryStep"]
    oai_tools: list["ChatCompletionToolParam"]

class GenerateOutputs(TypedDict):
    states: list[RolloutOutput]
    metadata: GenerateMetadata

We save exactly the same keys in metadata.json and each row in results.jsonl, making this PR fully backwards compatible.

Misc Changes

  • Add fixtures make_input, make_state, make_output,make_metadata to reduce redundancy in tests
  • Deprecate ProcessedOutputs (unused)

Example

The metadata and results have the exact same format as before

uv run vf-eval gsm8k -n5 -r3 -s

We can now safely save the full trajectory (including the raw OAI responses)

uv run vf-eval gsm8k -n5 -r3 -s -C trajectory

And even save non-serializable objects like clients which previously crashed runs

uv run vf-eval gsm8k -n5 -r3 -s -C client

Another example of an eval run that was crashing before (and motivated this PR) on a Pyarrow issue but is working now is the tau2-bench

uv run vf-eval tau2-bench -n -1 -r 2 -s -R

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Introduces a new results format and robust save pipeline centered on RolloutOutput.

  • Replace GenerateOutputs fields with outputs: list[RolloutOutput]; add typed RolloutOutput and update docs
  • New save utilities (GenerateOutputsBuilder, make_serializable, save_generate_outputs) for incremental, line-by-line JSONL saving with graceful handling of non-JSON types and partial failures
  • Refactor Environment.generate() to build/sort outputs incrementally, update callbacks to use serialized outputs
  • Update adapters and utilities (GEPA adapter, RL orchestrator, eval display, dataset builders) to consume outputs instead of per-field arrays
  • Add fixtures (make_input, make_state, make_output, make_metadata) and rewrite tests to the new schema; remove deprecated ProcessedOutputs
  • Extend metadata with tools; improve dataset/save helpers to read from serialized outputs

Written by Cursor Bugbot for commit 7db388b. This will update automatically on new commits. Configure here.

@mikasenghaas
Copy link
Member Author

@cursoragent review

@cursor

This comment was marked as outdated.

@mikasenghaas mikasenghaas force-pushed the overhaul-results-saving branch from b9f35f5 to d089d2c Compare January 25, 2026 12:16
@mikasenghaas mikasenghaas requested a review from willccbb January 26, 2026 15:23
@mikasenghaas mikasenghaas marked this pull request as ready for review January 26, 2026 15:23
@mikasenghaas mikasenghaas requested a review from rasdani January 26, 2026 15:52
mikasenghaas and others added 3 commits January 27, 2026 10:26
* serializable output type; generateoutputs builder

* serializable output type; generateoutputs builder

* bugbot fixes

* bugbot fixes

* bugbot fixes

* bugbot fixes
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@willccbb willccbb merged commit ae76c0a into main Jan 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants