Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRPO][Eval] Add letter counting eval #1574

Merged
merged 18 commits into from
Mar 31, 2025
Merged

Conversation

wizeng23
Copy link
Contributor

@wizeng23 wizeng23 commented Mar 27, 2025

Description

  • Add eval config and GCP job to count letters
  • Add functions to enable GRPO datasets to output conversations
  • Delete unneeded args for GRPO dataset init
  • Add custom evaluation function, and registered it
    • Created src/oumi/evaluation/registry directory for holding registered eval fns
  • Add system prompt to improve eval performance
  • Add version restrictions to vllm installs. The latest version 0.8.2 requires torch 2.6.0 while we're on 2.5.

Tested on GCP.

  • Accuracy for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: 74.9%
    • Without system prompt: 32.7%
  • Accuracy for deepseek-ai/DeepSeek-R1-Distill-Llama-8B: 93.4%
    • Without system prompt: 42.4%

Future work

  • Test remote inference
  • Support berry bench dataset
  • Run both datasets against most popular models

Related issues

Towards OPE-1122

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

):
count += 1

return EvaluationResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the recent change, just return {"count_letters": {"accuracy": count / total}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just be {"accuracy": count / total}, right?

Copy link
Collaborator

@penfever penfever Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wizeng23 @kaisopos , I think it would be nice to have a default behavior for this / future custom evals to (1) utilize standard metrics (most of the common ones are available in sklearn) and (2) return both balanced and unbalanced metrics by default (balanced accuracy, certain kinds of F1 scores are reasonable choices in most cases). You could group by letter count or by word difficulty (this is already done for BerryBench) for the balancing. Also for this particular eval, it makes sense to also report distance-sensitive scores; cosine similarity and MAE seem like reasonable choices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback Ben! I plan to do a follow-up PR to support BerryBench, but started with this dataset since we already had a working dataset class for it. Wanted to get this skeleton checked in first then check with you on the specifics of the eval for a future PR.

Utility functions to generate common metrics could be useful for folks writing custom evals, but IMO we may not want to be too opionionated on custom evals so users have flexibility on how they run custom evals. We could also wait for a few more custom eval examples to be checked in to see if there's common patterns among them. I'll mostly leave it up to Kostas though for custom evals. For the letter counting eval though, happy to do whatever you think is most useful!

# Config to eval an LLM's ability to count letters in words.
#
# Requirements:
# - Log into WandB (`wandb login`) or disable `enable_wandb`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is disabled by default, right? So maybe you should say sth like:
If you want to use WandB, log in (wandb login) and enable it with enable_wandb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I blindly copied that; it shouldn't be there, good catch! I did log a bug for wandb support for custom evals: https://linear.app/oumi/issue/OPE-1173

return self.transform_grpo_example(sample)
return self._transform_grpo_example(sample)

def conversation(self, idx: int) -> Conversation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversation and conversations seem very generic.
How come they are not defined in the BaseMapDataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is that BaseMapDataset is a much more generic base class, which is why those functions were defined in BaseSftDataset. I'm copying these functions over almost line-for-line as the GRPO dataset class needs similar behavior. I'll leave it up to Oussama if it makes sense to move it into BaseMapDataset or not.

@@ -235,6 +235,8 @@ def _get_custom_evaluation_fn(task_name: Optional[str]) -> Callable:
"task name, which should be corresponding to a registered evaluation "
"function, using the decorator `@register_evaluation_function`."
)
# Import to ensure custom evaluation functions are added to REGISTRY.
import oumi.evaluation.registry as evaluation_registry # noqa: F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this is not on the top of the file together with all the other imports?
Does this take too much time and we want a delayed initialization or something like that?
Is there any other reason? (it's weird to have an import inside a private function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pattern we follow for other registry imports, ex.

import oumi.datasets.grpo.rewards as grpo_rewards # noqa: F401
. I believe this is to reduce time for importing things from the registry, which was slowing down CLI runtime before. This lazy loading pattern gets around this. I think Matthew would know best here.



def _extract_prediction(response: str) -> Optional[int]:
r"""Returns the numeric answer extracted from `\boxed{...}`, or returns None."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this r in the beginning of the string of the description?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super super nit: "Returns A or B" (instead of "Returns A, or returns B")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's needed whenever there's a backslah \ in the text. Addressed the second point!

number_str = regex_result[0]
# Except clause shouldn't trigger because the regex should only find ints.
try:
return int(number_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if number_str is a bool?
Does int(False) throw or is it 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would double-check this works for corner cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK line 32 should always work, but I put the except clause there as a precaution. This is because the regex is explicitly only grabbing a sequence of numeric digits from the string, which should be able to convert into an int.

):
count += 1

return {"accuracy": count / total}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very debatable but just FYI:
if your letter counter fails to produce a prediction (prediction is None), then you count this as a model failure, while it could be a prompt failure (your prompt may work better for some models vs others).
An alternative (not sure which solution is more fair, maybe check with Jeremy too): If prediction is None ignore it (count does not increase but also total--)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value in both. While part of the responsibility is on the user to prompt the model well, some models may not be good enough at instruction following to properly format the example, which is a different type of failure than the failure to count letters.

I now log both metrics in addition to some other useful numbers. Feedback on the names welcome!

@wizeng23 wizeng23 merged commit eae46dd into main Mar 31, 2025
1 of 2 checks passed
@wizeng23 wizeng23 deleted the wizeng/o1122-letter-count branch March 31, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants