Conversation
Member
Author
|
@cursoragent review |
This comment was marked as outdated.
This comment was marked as outdated.
b9f35f5 to
d089d2c
Compare
24 tasks
* serializable output type; generateoutputs builder * serializable output type; generateoutputs builder * bugbot fixes * bugbot fixes * bugbot fixes * bugbot fixes
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
willccbb
approved these changes
Jan 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR overhauls utilities and types around saving output results. It tries to solve the following two common painpoints:
Datasetwhich often runs into (avoidable) PyArrow type issues because the type inference system would look at the first row to determine a type but if it differs (quite often the case for e.g.infoit would fail when saving later rows)We refactor
GenerateOutputsto simply contain a list of serialize states, calledRolloutOutpuand overhaul the saving and eval utilities to reflect the new format and robustify the logic:GenerateOutputsis now in row-order and contains all (finished) statesresults.jsonlline-by-line and catch exceptions so that only rollouts with errors do not get savedWe save exactly the same keys in
metadata.jsonand each row inresults.jsonl, making this PR fully backwards compatible.Misc Changes
make_input,make_state,make_output,make_metadatato reduce redundancy in testsProcessedOutputs(unused)Example
The metadata and results have the exact same format as before
We can now safely save the full trajectory (including the raw OAI responses)
And even save non-serializable objects like clients which previously crashed runs
Another example of an eval run that was crashing before (and motivated this PR) on a Pyarrow issue but is working now is the
tau2-benchType of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Introduces a new results format and robust save pipeline centered on
RolloutOutput.GenerateOutputsfields withoutputs: list[RolloutOutput]; add typedRolloutOutputand update docsGenerateOutputsBuilder,make_serializable,save_generate_outputs) for incremental, line-by-line JSONL saving with graceful handling of non-JSON types and partial failuresEnvironment.generate()to build/sort outputs incrementally, update callbacks to use serialized outputsoutputsinstead of per-field arraysmake_input,make_state,make_output,make_metadata) and rewrite tests to the new schema; remove deprecatedProcessedOutputstools; improve dataset/save helpers to read from serialized outputsWritten by Cursor Bugbot for commit 7db388b. This will update automatically on new commits. Configure here.