-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat: make reward functions to support parallel computation #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
def _single_thread_call(self, completions: List[Dict[str, str]], **kwargs) -> List[float]: | ||
results = [] | ||
for idx, completion in enumerate(completions): | ||
# prepare per-completion kwargs | ||
per_completion_kwargs = {} | ||
for key, value in kwargs.items(): | ||
if isinstance(value, list): | ||
per_completion_kwargs[key] = value[idx] | ||
else: | ||
per_completion_kwargs[key] = value | ||
results.append(self.reward_on_single_completion(completion, **per_completion_kwargs)) | ||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for implementing _single_thread_call
is that some reward evaluations are not thread-safe. When max_workers==1, we use _single_thread_call
to sequentially compute the reward for each example.
Currently, math_verify.parse
is not thread-safe: huggingface/Math-Verify#22
This sequential execution mode ensures correct reward computation for non-thread-safe evaluators, even though it doesn't take advantage of parallel processing capabilities.
I'll update to multiprocessing instead of multithreading if you find this PR acceptable. (huggingface/Math-Verify#22 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, can you add a test that compares single vs multi-thread/proc to ensure they both return the same results.
Sure |
Hey @0x404 I am looking at GRPO training with code rewards today, so I will try and get this PR cleaned up and merged. |
Sure, thanks. I've been busy with other things these couple of days and haven't had time to update this PR. |
Hi @0x404 thanks for the PR! Aside from code execution with E2B, do you know which (if any) of the reward functions are slow to execute? If it's just the E2B sandbox, I'm wondering if it's better to use their native async sandbox instead of enforcing multi-threading on all rewards. |
hi, @lewtun, I think currently only e2b rewards are relatively slow |
Thanks, let me take a stab at making it async first since I am quite partial to the simplicity of having simple functions per reward. |
Resolved with aysnc by #484 ? |
yes |
Motivation:
Current reward functions generally take a completion list and calculate rewards for each example in this list serially. In some cases, calculating a reward for a single example may take much time, for example, when running code evaluation in e2b. In such situations, if the reward scores for each example could be calculated in parallel, it would improve our training speed.
Approach:
This PR abstracts out a
BaseRewardFunction
class, where each reward function inherits from this class and implements its ownreward_on_single_completion
. This function will receive the completion for each example to be evaluated, along with its corresponding kwargs.This refactor should make the code clearer and help us support more reward functions in the future.