-
Notifications
You must be signed in to change notification settings - Fork 326
Adaptive evaluation #3397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive evaluation #3397
Conversation
Thanks! Overall this looks good and I think we can merge it with some changes. Questions:
High level feedback:
In terms of next steps, we should resolve the questions in the section above, and then you can let me know when this is ready for a full code review, and I'll send you requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer. |
Hi @yifanmai! Thank you so much for your comment. We addressed most of them in the latest commit and included the reply to your query below. Questions
We aim to make the tool production-ready and support most benchmarks in HELM.
We will upload the difficulty on HuggingFace.
We plan to set a default initial value of model ability (e.g., 0). The users can also set it (e.g., when they have good prior knowledge about the model's ability).
We will provide a separate Python package to compute the calibration and determine the difficulties of the questions of new datasets. High-level feedback
We have subclassed a
We have changed all the
Besides Air-Bench, we implement our method on MMLU, which might not have the above problems.
We now use the run_spec to find the main metric name and search list of metrics.
I would appreciate any suggestions you might have for further improvement.
Yes, I will find a time in your calendar so we can meet for pair programming. In the meantime, please let me know if you have any comments or feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the responses. I think your plan makes sense. Left you more detailed comments - we can merge after these are addressed.
Great, this approach looks good. I noticed there are still some missing pieces in the current implementation; please let me know when those are ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good but you have to fix the type errors before we can merge.
docs/reeval.md
Outdated
@@ -0,0 +1,32 @@ | |||
# Reliable and Efficient Amortized Model-based Evaluation | |||
|
|||
Reliable and Efficient Amortized Model-based Evaluation (Reeval) is an extension of the HELM framework for using Computerized Adaptive Testing (CAT) within the framework of Item Response Theory (IRT) to adaptively evaluate Large Language Models (LLMs). This approach selects the next question whose difficulty is closest to the estimated model ability, thereby reliably and efficiently eliciting the model's ability. The difficulties of the questions are provided on HuggingFace: [`stair-lab/reeval-difficulty-for-helm`](https://huggingface.co/datasets/stair-lab/reeval-difficulty-for-helm), which currently supports 22 scenarios in HELM. The paper's authors will supply a Python package for calculating these difficulties and will support more scenarios in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "(Reeval)" -> "(REEval)"
src/helm/benchmark/reeval_runner.py
Outdated
|
||
if iteration > 0: | ||
prev_ability = ability.clone() | ||
prev_loss = loss.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error:
src/helm/benchmark/reeval_runner.py:127: error: "float" has no attribute "clone" [attr-defined]
loss
is a float rather than a tensor (doc), so:
prev_loss = loss
Merged. Thanks for your hard work! |
Improving evaluation efficiency by selecting informative questions via a priority queue dataloader. We demonstrate adaptive evaluation on AIR-Bench. To run:
Please let us know if your comments and feedback.
Addresses #3323