feat: make reward functions to support parallel computation #398

0x404 · 2025-02-23T09:39:32Z

Motivation:
Current reward functions generally take a completion list and calculate rewards for each example in this list serially. In some cases, calculating a reward for a single example may take much time, for example, when running code evaluation in e2b. In such situations, if the reward scores for each example could be calculated in parallel, it would improve our training speed.

Approach:
This PR abstracts out a BaseRewardFunction class, where each reward function inherits from this class and implements its ownreward_on_single_completion. This function will receive the completion for each example to be evaluated, along with its corresponding kwargs.

This refactor should make the code clearer and help us support more reward functions in the future.

0x404 · 2025-02-23T09:51:54Z

src/open_r1/rewards.py

+    def _single_thread_call(self, completions: List[Dict[str, str]], **kwargs) -> List[float]:
+        results = []
+        for idx, completion in enumerate(completions):
+            # prepare per-completion kwargs
+            per_completion_kwargs = {}
+            for key, value in kwargs.items():
+                if isinstance(value, list):
+                    per_completion_kwargs[key] = value[idx]
+                else:
+                    per_completion_kwargs[key] = value
+            results.append(self.reward_on_single_completion(completion, **per_completion_kwargs))
+        return results


The reason for implementing _single_thread_call is that some reward evaluations are not thread-safe. When max_workers==1, we use _single_thread_call to sequentially compute the reward for each example.

Currently, math_verify.parse is not thread-safe: huggingface/Math-Verify#22

This sequential execution mode ensures correct reward computation for non-thread-safe evaluators, even though it doesn't take advantage of parallel processing capabilities.

0x404 · 2025-02-24T11:26:43Z

I'll update to multiprocessing instead of multithreading if you find this PR acceptable. (huggingface/Math-Verify#22 (comment))

edbeeching

Thanks for adding this, can you add a test that compares single vs multi-thread/proc to ensure they both return the same results.

0x404 · 2025-02-26T08:55:42Z

Sure

edbeeching · 2025-02-27T08:33:58Z

Hey @0x404 I am looking at GRPO training with code rewards today, so I will try and get this PR cleaned up and merged.

0x404 · 2025-02-27T08:35:56Z

Sure, thanks. I've been busy with other things these couple of days and haven't had time to update this PR.

lewtun · 2025-03-03T13:20:13Z

Hi @0x404 thanks for the PR! Aside from code execution with E2B, do you know which (if any) of the reward functions are slow to execute?

If it's just the E2B sandbox, I'm wondering if it's better to use their native async sandbox instead of enforcing multi-threading on all rewards.

0x404 · 2025-03-04T03:05:03Z

hi, @lewtun, I think currently only e2b rewards are relatively slow

lewtun · 2025-03-04T12:44:52Z

hi, @lewtun, I think currently only e2b rewards are relatively slow

Thanks, let me take a stab at making it async first since I am quite partial to the simplicity of having simple functions per reward.

edbeeching · 2025-03-14T08:30:49Z

Resolved with aysnc by #484 ?

0x404 · 2025-03-14T08:49:57Z

yes

0x404 added 4 commits February 23, 2025 17:03

refactor: make reward func more general and support parallel computation

2f6f8cf

tests: adapt tests

549ef66

use the main thread when max_workers==1

74f89da

reverse test main change

7d8cb36

0x404 commented Feb 23, 2025

View reviewed changes

edbeeching requested changes Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make reward functions to support parallel computation #398

feat: make reward functions to support parallel computation #398

0x404 commented Feb 23, 2025

0x404 Feb 23, 2025

0x404 commented Feb 24, 2025

edbeeching left a comment

0x404 commented Feb 26, 2025

edbeeching commented Feb 27, 2025 •

edited

Loading

0x404 commented Feb 27, 2025

lewtun commented Mar 3, 2025

0x404 commented Mar 4, 2025

lewtun commented Mar 4, 2025

edbeeching commented Mar 14, 2025

0x404 commented Mar 14, 2025

feat: make reward functions to support parallel computation #398

Are you sure you want to change the base?

feat: make reward functions to support parallel computation #398

Conversation

0x404 commented Feb 23, 2025

0x404 Feb 23, 2025

Choose a reason for hiding this comment

0x404 commented Feb 24, 2025

edbeeching left a comment

Choose a reason for hiding this comment

0x404 commented Feb 26, 2025

edbeeching commented Feb 27, 2025 • edited Loading

0x404 commented Feb 27, 2025

lewtun commented Mar 3, 2025

0x404 commented Mar 4, 2025

lewtun commented Mar 4, 2025

edbeeching commented Mar 14, 2025

0x404 commented Mar 14, 2025

edbeeching commented Feb 27, 2025 •

edited

Loading