Feature/reward task #53

finitearth · 2025-07-18T09:25:08Z

implements new tasks: RewardTask (accepts a reward function mapping from the prediction to a score), and JudgeTask (uses an LLM to score the responses. Optionally also accepts groundtruth labels, allowing for "fuzzy matches").
core functionalities of classification task has been moved to base task to prevent code duplication for other tasks
CAPO now accepts input parameter "check_fs_accuracy" (default True) - in case of reward tasks the accuracy cannot be evaluated, so we will take the prediction of the downstream_llm as target of fs.
CAPO also accepts "create_fs_reasoning" (default is True): if set to false, just use input-output pairs from df_few_shots
introduces tag-extraction function, to centralize repeated code for extractions like "<final_answer>5</final_answer>"
boosted test coverage

github-actions · 2025-07-18T09:40:55Z

Tests	Skipped	Failures	Errors	Time
84	0 💤	0 ❌	0 🔥	0.902s ⏱️

…omptolution into feature/RewardTask

Copilot

Pull Request Overview

This PR implements new task types for reward-based and LLM-as-judge evaluation, refactors the task architecture to reduce code duplication, and introduces several utility functions to improve functionality and test coverage.

Implements RewardTask (accepts reward functions for prediction scoring) and JudgeTask (uses LLM to score responses with optional ground truth)
Refactors core evaluation functionality from ClassificationTask to BaseTask to enable code reuse across different task types
Adds utility functions for tag extraction and improves CAPO to handle scenarios where accuracy cannot be evaluated

Reviewed Changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
promptolution/tasks/base_task.py	Major refactor moving evaluation logic from ClassificationTask to enable inheritance by new task types
promptolution/tasks/reward_tasks.py	New RewardTask implementation for scoring predictions with custom reward functions
promptolution/tasks/judge_tasks.py	New JudgeTask implementation for LLM-based evaluation with optional ground truth
promptolution/utils/formatting.py	New utility module for tag extraction functionality
promptolution/optimizers/capo.py	Added check_fs_accuracy parameter to handle reward tasks without ground truth
tests/	Comprehensive test coverage for new functionality and updated existing tests

tests/mocks/mock_predictor.py

tests/conftest.py

promptolution/tasks/judge_tasks.py

tests/tasks/test_classifications_tasks.py

promptolution/utils/formatting.py

finitearth · 2025-07-21T14:18:52Z

tests are red right now, fix is in next PR

promptolution/helpers.py

timo282 · 2025-08-31T10:57:40Z

promptolution/helpers.py

+    df: pd.DataFrame,
+    config: "ExperimentConfig",
+    task_type: Literal["classification", "reward", "judge"] = None,
+    judge_llm: "BaseLLM" = None,


This is a rather general remark / question:
If an argument can be None (or is so by default), I (and many others) usually do

judge_llm: Optional["BaseLLM"] = None # or judge_llm: "BaseLLM" | None = None

(since None is not a valid BaseLLM)

great catch, i will leave it here as is, because it will be fixed with the mypy pull request!

timo282 · 2025-08-31T11:01:24Z

promptolution/optimizers/capo.py

            test_statistic (TestStatistics): Statistical test to compare prompt performance. Default is "paired_t_test".
            alpha (float): Significance level for the statistical test.
            length_penalty (float): Penalty factor for prompt length.
+            check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt.


What does "accuracy of few-shot examples" mean? Where is this check implemented?

In the original implementation of capo we implemented a check for making sure that the few shot examples that contain generated reasoning by the down stream llm have a correct prediction. however there is no "correctness" (=accuracy) when we talk about rewards

promptolution/tasks/judge_tasks.py

promptolution/tasks/reward_tasks.py

tutorials/reward_task_tutorial.ipynb

timo282 · 2025-08-31T11:57:10Z

tests are red right now, fix is in next PR

Is this fixed in #54 ?

mo374z · 2025-09-01T06:16:10Z

promptolution/optimizers/capo.py

            alpha (float): Significance level for the statistical test.
            length_penalty (float): Penalty factor for prompt length.
+            check_fs_accuracy (bool): Whether to check the accuracy of few-shot examples before appending them to the prompt.
+                In cases such as reward tasks, this can be set to False, as no ground truth is available. Default is True.


Should we really let the user decide this? can't we just dont check this if no groundtruth is available and otherwise always do it - is there a reasonable case where i have a groundtruth and would want to set this to False?

Also had that thought, however the problem is, that for example in the case of LLM-as-a-Judge, there exists a groundtruth, but there is no need for the prediction to exactly match the groundtruth.

* added mypy to pre-commit and improved typing * go green * reset notebook * fix security vulrnerabilities

…omptolution into feature/RewardTask

finitearth added 6 commits July 17, 2025 15:36

initial commit

5adc751

fixed tests

4565056

boost test coverage

cf1b5bc

improve docstrings

6fa4163

allow for non ground truth checks in capo)

cad81e8

fix calculate score tests

4af7ce1

finitearth and others added 3 commits July 18, 2025 09:40

Update coverage badge in README [skip ci]

f9310fc

remove useless commments

5df4221

Merge branch 'feature/RewardTask' of https://github.com/finitearth/pr…

a568f13

…omptolution into feature/RewardTask

finitearth mentioned this pull request Jul 18, 2025

"Unsupervised" Prompt Tuning #22

Closed

finitearth requested a review from Copilot July 18, 2025 11:44

Copilot AI reviewed Jul 18, 2025

View reviewed changes

finitearth added 5 commits July 18, 2025 13:48

make reasoning for fs optional in CAPO

72f82e1

remove prints

e2d4b60

Update formatting.py

a334229

update tutorials

aebe5be

formatting

b7e34a9

finitearth marked this pull request as ready for review July 21, 2025 14:18

finitearth requested a review from mo374z as a code owner July 21, 2025 14:18

finitearth requested a review from timo282 July 22, 2025 13:58

timo282 approved these changes Aug 31, 2025

View reviewed changes

mo374z approved these changes Sep 1, 2025

View reviewed changes

finitearth added 5 commits September 2, 2025 22:17

work in pr comments

91585f8

added mypy to pre-commit and improved typing (#54)

51e574f

* added mypy to pre-commit and improved typing * go green * reset notebook * fix security vulrnerabilities

Merge branch 'main' into feature/RewardTask

d9a66ef

Update pyproject.toml

99f67cf

Update vllm version to 0.10.1.1

8f34eab

finitearth and others added 21 commits September 2, 2025 23:20

constrain python for vllm

7080bf1

typing fix

b87591d

remove empty folders

8df9282

remove gitmodules

dd7dbf0

Update test command in CI workflow

065625c

idk

f9c4142

Correct test command in CI workflow

39a7b07

remove vllm tests

7fc7e84

refined reward task tutorial

8d6e3c5

Merge branch 'feature/RewardTask' of https://github.com/finitearth/pr…

443b661

…omptolution into feature/RewardTask

Update coverage badge in README [skip ci]

ad0fe6e

fix tests and tutorial pages

d38b0f8

Merge branch 'feature/RewardTask' of https://github.com/finitearth/pr…

4c98ca0

…omptolution into feature/RewardTask

Update coverage badge in README [skip ci]

a79828e

Add Colab badge to README

5cd6f0d

Update notebook example generation to include tutorials

b1fa2c6

Add new tutorials to mkdocs configuration

db3b786

release notes

0b34d86

Merge branch 'feature/RewardTask' of https://github.com/finitearth/pr…

117fbfc

…omptolution into feature/RewardTask

release notes

600093b

Update release notes for version 2.1.0

48dca49

timo282 self-requested a review September 3, 2025 12:52

timo282 approved these changes Sep 3, 2025

View reviewed changes

finitearth merged commit ab458fc into main Sep 3, 2025
5 checks passed

finitearth deleted the feature/RewardTask branch September 3, 2025 13:08

Feature/reward task #53

Feature/reward task #53

Uh oh!

Conversation

finitearth commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

finitearth commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timo282 Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

finitearth Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

timo282 Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

finitearth Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timo282 commented Aug 31, 2025

Uh oh!

mo374z Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

finitearth Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

finitearth commented Jul 18, 2025 •

edited

Loading

github-actions bot commented Jul 18, 2025 •

edited

Loading