Skip to content

feat: make reward functions to support parallel computation #398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

0x404
Copy link

@0x404 0x404 commented Feb 23, 2025

Motivation:
Current reward functions generally take a completion list and calculate rewards for each example in this list serially. In some cases, calculating a reward for a single example may take much time, for example, when running code evaluation in e2b. In such situations, if the reward scores for each example could be calculated in parallel, it would improve our training speed.

Approach:
This PR abstracts out a BaseRewardFunction class, where each reward function inherits from this class and implements its ownreward_on_single_completion. This function will receive the completion for each example to be evaluated, along with its corresponding kwargs.

This refactor should make the code clearer and help us support more reward functions in the future.

Comment on lines +42 to +53
def _single_thread_call(self, completions: List[Dict[str, str]], **kwargs) -> List[float]:
results = []
for idx, completion in enumerate(completions):
# prepare per-completion kwargs
per_completion_kwargs = {}
for key, value in kwargs.items():
if isinstance(value, list):
per_completion_kwargs[key] = value[idx]
else:
per_completion_kwargs[key] = value
results.append(self.reward_on_single_completion(completion, **per_completion_kwargs))
return results
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for implementing _single_thread_call is that some reward evaluations are not thread-safe. When max_workers==1, we use _single_thread_call to sequentially compute the reward for each example.

Currently, math_verify.parse is not thread-safe: huggingface/Math-Verify#22

This sequential execution mode ensures correct reward computation for non-thread-safe evaluators, even though it doesn't take advantage of parallel processing capabilities.

@0x404
Copy link
Author

0x404 commented Feb 24, 2025

I'll update to multiprocessing instead of multithreading if you find this PR acceptable. (huggingface/Math-Verify#22 (comment))

Copy link
Collaborator

@edbeeching edbeeching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, can you add a test that compares single vs multi-thread/proc to ensure they both return the same results.

@0x404
Copy link
Author

0x404 commented Feb 26, 2025

Sure

@edbeeching
Copy link
Collaborator

edbeeching commented Feb 27, 2025

Hey @0x404 I am looking at GRPO training with code rewards today, so I will try and get this PR cleaned up and merged.

@0x404
Copy link
Author

0x404 commented Feb 27, 2025

Sure, thanks. I've been busy with other things these couple of days and haven't had time to update this PR.

@lewtun
Copy link
Member

lewtun commented Mar 3, 2025

Hi @0x404 thanks for the PR! Aside from code execution with E2B, do you know which (if any) of the reward functions are slow to execute?

If it's just the E2B sandbox, I'm wondering if it's better to use their native async sandbox instead of enforcing multi-threading on all rewards.

@0x404
Copy link
Author

0x404 commented Mar 4, 2025

hi, @lewtun, I think currently only e2b rewards are relatively slow

@lewtun
Copy link
Member

lewtun commented Mar 4, 2025

hi, @lewtun, I think currently only e2b rewards are relatively slow

Thanks, let me take a stab at making it async first since I am quite partial to the simplicity of having simple functions per reward.

@edbeeching
Copy link
Collaborator

Resolved with aysnc by #484 ?

@0x404
Copy link
Author

0x404 commented Mar 14, 2025

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants