[Refactor] Update the example usage for the @osmosis_rubric#11
Merged
[Refactor] Update the example usage for the @osmosis_rubric#11
@osmosis_rubric#11Conversation
…smosis-git-sync-example into brian/reward_rubric
… module. Updated method names for clarity, streamlined extra info construction, and enhanced dataset loading logic.
Contributor
Author
|
The test fails is expected. Because we refactored the |
BaiqingL
approved these changes
Oct 31, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request modernizes the rubric scoring workflow and improves usability, maintainability, and dataset support for the support conversation evaluation system. The most important changes include a refactor of the rubric scorer script to use a schema-driven config and dataset loader, updates to the workflow and documentation to support new usage patterns, and the introduction of a sample dataset for batch evaluation.
Rubric evaluation system refactor:
reward_rubric.pyto use schema-driven config loading, dataset records, and a simplified entrypoint (score_support_conversation). The script now loads YAML config and JSONL data, supports batch evaluation, and handles provider/model selection via config or environment.reward_rubric_config.yamlto use a versioned schema with arubrics[]array, separating rubric details and supporting multiple rubrics and default values.Dataset and example improvements:
reward_rubric_example.jsonto use a flat structure (solution_str,original_input,ground_truth) instead of a message array, matching new dataset format.sample_data.jsonlas a JSONL dataset for batch rubric evaluation and CLI preview, with multiple conversation records.Workflow and script updates:
reward_rubric.yml) to call the new shell script (run_reward_rubric.sh) and trigger on changes to the example and scorer script, not just the config. [1] [2] [3]run_reward_rubric.shto invoke the scorer as a Python module and accept CLI arguments for alternate data files.Documentation enhancements:
README.mdto explain the new config schema, script usage, dataset format, and CLI options for previewing and evaluating rubrics. [1] [2] [3]