-
Notifications
You must be signed in to change notification settings - Fork 3
Feature/reward task #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements new task types for reward-based and LLM-as-judge evaluation, refactors the task architecture to reduce code duplication, and introduces several utility functions to improve functionality and test coverage.
- Implements RewardTask (accepts reward functions for prediction scoring) and JudgeTask (uses LLM to score responses with optional ground truth)
- Refactors core evaluation functionality from ClassificationTask to BaseTask to enable code reuse across different task types
- Adds utility functions for tag extraction and improves CAPO to handle scenarios where accuracy cannot be evaluated
Reviewed Changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| promptolution/tasks/base_task.py | Major refactor moving evaluation logic from ClassificationTask to enable inheritance by new task types |
| promptolution/tasks/reward_tasks.py | New RewardTask implementation for scoring predictions with custom reward functions |
| promptolution/tasks/judge_tasks.py | New JudgeTask implementation for LLM-based evaluation with optional ground truth |
| promptolution/utils/formatting.py | New utility module for tag extraction functionality |
| promptolution/optimizers/capo.py | Added check_fs_accuracy parameter to handle reward tasks without ground truth |
| tests/ | Comprehensive test coverage for new functionality and updated existing tests |
|
tests are red right now, fix is in next PR |
promptolution/helpers.py
Outdated
| df: pd.DataFrame, | ||
| config: "ExperimentConfig", | ||
| task_type: Literal["classification", "reward", "judge"] = None, | ||
| judge_llm: "BaseLLM" = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather general remark / question:
If an argument can be None (or is so by default), I (and many others) usually do
judge_llm: Optional["BaseLLM"] = None
# or
judge_llm: "BaseLLM" | None = None(since None is not a valid BaseLLM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch, i will leave it here as is, because it will be fixed with the mypy pull request!
| test_statistic (TestStatistics): Statistical test to compare prompt performance. Default is "paired_t_test". | ||
| alpha (float): Significance level for the statistical test. | ||
| length_penalty (float): Penalty factor for prompt length. | ||
| check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "accuracy of few-shot examples" mean? Where is this check implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original implementation of capo we implemented a check for making sure that the few shot examples that contain generated reasoning by the down stream llm have a correct prediction. however there is no "correctness" (=accuracy) when we talk about rewards
Is this fixed in #54 ? |
| alpha (float): Significance level for the statistical test. | ||
| length_penalty (float): Penalty factor for prompt length. | ||
| check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt. | ||
| In cases such as reward tasks, this can be set to False, as no ground truth is available. Default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really let the user decide this? can't we just dont check this if no groundtruth is available and otherwise always do it - is there a reasonable case where i have a groundtruth and would want to set this to False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also had that thought, however the problem is, that for example in the case of LLM-as-a-Judge, there exists a groundtruth, but there is no need for the prediction to exactly match the groundtruth.
* added mypy to pre-commit and improved typing * go green * reset notebook * fix security vulrnerabilities
…omptolution into feature/RewardTask
…omptolution into feature/RewardTask
…omptolution into feature/RewardTask
Uh oh!
There was an error while loading. Please reload this page.