-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GRPO][Eval] Add letter counting eval #1574
Conversation
): | ||
count += 1 | ||
|
||
return EvaluationResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the recent change, just return {"count_letters": {"accuracy": count / total}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just be {"accuracy": count / total}
, right?
notebooks/Oumi - Build your own Custom Evaluation (Hallucination Classifier).ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wizeng23 @kaisopos , I think it would be nice to have a default behavior for this / future custom evals to (1) utilize standard metrics (most of the common ones are available in sklearn) and (2) return both balanced and unbalanced metrics by default (balanced accuracy, certain kinds of F1 scores are reasonable choices in most cases). You could group by letter count or by word difficulty (this is already done for BerryBench) for the balancing. Also for this particular eval, it makes sense to also report distance-sensitive scores; cosine similarity and MAE seem like reasonable choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback Ben! I plan to do a follow-up PR to support BerryBench, but started with this dataset since we already had a working dataset class for it. Wanted to get this skeleton checked in first then check with you on the specifics of the eval for a future PR.
Utility functions to generate common metrics could be useful for folks writing custom evals, but IMO we may not want to be too opionionated on custom evals so users have flexibility on how they run custom evals. We could also wait for a few more custom eval examples to be checked in to see if there's common patterns among them. I'll mostly leave it up to Kostas though for custom evals. For the letter counting eval though, happy to do whatever you think is most useful!
# Config to eval an LLM's ability to count letters in words. | ||
# | ||
# Requirements: | ||
# - Log into WandB (`wandb login`) or disable `enable_wandb` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is disabled by default, right? So maybe you should say sth like:
If you want to use WandB, log in (wandb login
) and enable it with enable_wandb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I blindly copied that; it shouldn't be there, good catch! I did log a bug for wandb support for custom evals: https://linear.app/oumi/issue/OPE-1173
return self.transform_grpo_example(sample) | ||
return self._transform_grpo_example(sample) | ||
|
||
def conversation(self, idx: int) -> Conversation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conversation and conversations seem very generic.
How come they are not defined in the BaseMapDataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that BaseMapDataset is a much more generic base class, which is why those functions were defined in BaseSftDataset. I'm copying these functions over almost line-for-line as the GRPO dataset class needs similar behavior. I'll leave it up to Oussama if it makes sense to move it into BaseMapDataset or not.
@@ -235,6 +235,8 @@ def _get_custom_evaluation_fn(task_name: Optional[str]) -> Callable: | |||
"task name, which should be corresponding to a registered evaluation " | |||
"function, using the decorator `@register_evaluation_function`." | |||
) | |||
# Import to ensure custom evaluation functions are added to REGISTRY. | |||
import oumi.evaluation.registry as evaluation_registry # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this is not on the top of the file together with all the other imports?
Does this take too much time and we want a delayed initialization or something like that?
Is there any other reason? (it's weird to have an import inside a private function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pattern we follow for other registry imports, ex.
oumi/src/oumi/builders/rewards.py
Line 26 in b66ff54
import oumi.datasets.grpo.rewards as grpo_rewards # noqa: F401 |
|
||
|
||
def _extract_prediction(response: str) -> Optional[int]: | ||
r"""Returns the numeric answer extracted from `\boxed{...}`, or returns None.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need this r
in the beginning of the string of the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super super nit: "Returns A or B" (instead of "Returns A, or returns B")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's needed whenever there's a backslah \ in the text. Addressed the second point!
number_str = regex_result[0] | ||
# Except clause shouldn't trigger because the regex should only find ints. | ||
try: | ||
return int(number_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if number_str
is a bool?
Does int(False
) throw or is it 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would double-check this works for corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK line 32 should always work, but I put the except clause there as a precaution. This is because the regex is explicitly only grabbing a sequence of numeric digits from the string, which should be able to convert into an int.
): | ||
count += 1 | ||
|
||
return {"accuracy": count / total} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very debatable but just FYI:
if your letter counter fails to produce a prediction (prediction is None), then you count this as a model failure, while it could be a prompt failure (your prompt may work better for some models vs others).
An alternative (not sure which solution is more fair, maybe check with Jeremy too): If prediction is None ignore it (count
does not increase but also total
--)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's value in both. While part of the responsibility is on the user to prompt the model well, some models may not be good enough at instruction following to properly format the example, which is a different type of failure than the failure to count letters.
I now log both metrics in addition to some other useful numbers. Feedback on the names welcome!
Description
src/oumi/evaluation/registry
directory for holding registered eval fnsTested on GCP.
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
: 74.9%deepseek-ai/DeepSeek-R1-Distill-Llama-8B
: 93.4%Future work
Related issues
Towards OPE-1122
Before submitting