Skip to content

Conversation

@finitearth
Copy link
Owner

@finitearth finitearth commented Jul 18, 2025

  • implements new tasks: RewardTask (accepts a reward function mapping from the prediction to a score), and JudgeTask (uses an LLM to score the responses. Optionally also accepts groundtruth labels, allowing for "fuzzy matches").
  • core functionalities of classification task has been moved to base task to prevent code duplication for other tasks
  • CAPO now accepts input parameter "check_fs_accuracy" (default True) - in case of reward tasks the accuracy cannot be evaluated, so we will take the prediction of the downstream_llm as target of fs.
  • CAPO also accepts "create_fs_reasoning" (default is True): if set to false, just use input-output pairs from df_few_shots
  • introduces tag-extraction function, to centralize repeated code for extractions like "<final_answer>5</final_answer>"
  • boosted test coverage

@github-actions
Copy link

github-actions bot commented Jul 18, 2025

Coverage

Tests Skipped Failures Errors Time
84 0 💤 0 ❌ 0 🔥 0.902s ⏱️

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements new task types for reward-based and LLM-as-judge evaluation, refactors the task architecture to reduce code duplication, and introduces several utility functions to improve functionality and test coverage.

  • Implements RewardTask (accepts reward functions for prediction scoring) and JudgeTask (uses LLM to score responses with optional ground truth)
  • Refactors core evaluation functionality from ClassificationTask to BaseTask to enable code reuse across different task types
  • Adds utility functions for tag extraction and improves CAPO to handle scenarios where accuracy cannot be evaluated

Reviewed Changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
promptolution/tasks/base_task.py Major refactor moving evaluation logic from ClassificationTask to enable inheritance by new task types
promptolution/tasks/reward_tasks.py New RewardTask implementation for scoring predictions with custom reward functions
promptolution/tasks/judge_tasks.py New JudgeTask implementation for LLM-based evaluation with optional ground truth
promptolution/utils/formatting.py New utility module for tag extraction functionality
promptolution/optimizers/capo.py Added check_fs_accuracy parameter to handle reward tasks without ground truth
tests/ Comprehensive test coverage for new functionality and updated existing tests

@finitearth finitearth marked this pull request as ready for review July 21, 2025 14:18
@finitearth finitearth requested a review from mo374z as a code owner July 21, 2025 14:18
@finitearth
Copy link
Owner Author

tests are red right now, fix is in next PR

@finitearth finitearth requested a review from timo282 July 22, 2025 13:58
df: pd.DataFrame,
config: "ExperimentConfig",
task_type: Literal["classification", "reward", "judge"] = None,
judge_llm: "BaseLLM" = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather general remark / question:
If an argument can be None (or is so by default), I (and many others) usually do

judge_llm: Optional["BaseLLM"] = None
# or
judge_llm: "BaseLLM" | None = None

(since None is not a valid BaseLLM)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, i will leave it here as is, because it will be fixed with the mypy pull request!

test_statistic (TestStatistics): Statistical test to compare prompt performance. Default is "paired_t_test".
alpha (float): Significance level for the statistical test.
length_penalty (float): Penalty factor for prompt length.
check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "accuracy of few-shot examples" mean? Where is this check implemented?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original implementation of capo we implemented a check for making sure that the few shot examples that contain generated reasoning by the down stream llm have a correct prediction. however there is no "correctness" (=accuracy) when we talk about rewards

@timo282
Copy link
Collaborator

timo282 commented Aug 31, 2025

tests are red right now, fix is in next PR

Is this fixed in #54 ?

alpha (float): Significance level for the statistical test.
length_penalty (float): Penalty factor for prompt length.
check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt.
In cases such as reward tasks, this can be set to False, as no ground truth is available. Default is True.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really let the user decide this? can't we just dont check this if no groundtruth is available and otherwise always do it - is there a reasonable case where i have a groundtruth and would want to set this to False?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also had that thought, however the problem is, that for example in the case of LLM-as-a-Judge, there exists a groundtruth, but there is no need for the prediction to exactly match the groundtruth.

@timo282 timo282 self-requested a review September 3, 2025 12:52
@finitearth finitearth merged commit ab458fc into main Sep 3, 2025
5 checks passed
@finitearth finitearth deleted the feature/RewardTask branch September 3, 2025 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants