Adaptive evaluation #3397

sangttruong · 2025-03-02T18:12:05Z

Improving evaluation efficiency by selecting informative questions via a priority queue dataloader. We demonstrate adaptive evaluation on AIR-Bench. To run:

export SUITE_NAME=adaptive
export MODELS_TO_RUN=meta-llama/Llama-3.2-3B-Instruct
export HF_MODEL=meta-llama/Llama-3.2-3B-Instruct
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_adaptive_air_bench.conf
export SCHEMA_PATH=schema_air_bench.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=2
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --enable-huggingface-models $HF_MODEL --dry-run

Please let us know if your comments and feedback.

Addresses #3323

yifanmai · 2025-03-04T21:39:07Z

Thanks! Overall this looks good and I think we can merge it with some changes.

Questions:

What is the positioning of this tool? Is it meant to only be used to reproduce the paper's results, or to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?
Currently the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?
For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users should provide the parameters themselves, or should the initial model ability be set to some constant?
Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

High level feedback:

Changes to core framework code should be minimized, to avoid increasing software complexity and maintenance costs, and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there, rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.
If this is an experimental tool, it should be clear made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.
Unfortunately AIR-Bench is not a good application for this, because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.
Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not true generally for all datasets.

In terms of next steps, we should resolve the questions in the section above, and then you can let me know when this is ready for a full code review, and I'll send you requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer.

sangttruong · 2025-03-08T19:09:26Z

Hi @yifanmai! Thank you so much for your comment. We addressed most of them in the latest commit and included the reply to your query below.

Questions

What is the positioning of this tool? Is it meant to be used only to reproduce the paper's results, to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?

We aim to make the tool production-ready and support most benchmarks in HELM.

Currently, the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?

We will upload the difficulty on HuggingFace.

For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users provide the parameters themselves, or should the initial model ability be set to some constant?

We plan to set a default initial value of model ability (e.g., 0). The users can also set it (e.g., when they have good prior knowledge about the model's ability).

Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

We will provide a separate Python package to compute the calibration and determine the difficulties of the questions of new datasets.

High-level feedback

Changes to core framework code should be minimized to avoid increasing software complexity and maintenance costs and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.

We have subclassed a ReevalRunner in a separate file reeval_runner.py, and run the code using --runner-class-name helm.benchmark.reeval_runner.ReevalRunner. Now, the original runner code stays the same as the original code.
We used to add two parameters reeval_mode: bool = False and reeval_max_samples: int = 50 to the src/helm/benchmark/run_spec.py. Then, we create some files for each dataset. Using AIR-Bench as an example: we create run_entries_adaptive_air_bench.conf and add function get_adaptive_air_bench_2024_spec to air_bench_run_specs.py.

If this is an experimental tool, it should be made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.

We have changed all the adaptive names in the code to reeval (reliable and efficient evaluation), which is an abbreviation for our method. We can be very flexible with the naming. We will note in the documentation which benchmark will be supported by the tools.

Unfortunately, AIR-Bench is not a good application for this because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top-level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.

Besides Air-Bench, we implement our method on MMLU, which might not have the above problems.
To run MMLU using ReevalRunner:

cd src
export SUITE_NAME=subclass_reeval_mmlu_simple_model1
export MODELS_TO_RUN=simple/model1
export RUN_ENTRIES_CONF_PATH=helm/benchmark/presentation/run_entries_reeval_mmlu.conf
export SCHEMA_PATH=schema_mmlu.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=4
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --runner-class-name helm.benchmark.reeval_runner.ReevalRunner

Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not generally true for all datasets.

We now use the run_spec to find the main metric name and search list of metrics.

    scenario_metric_name = run_spec.metric_specs[0].args["names"][0]
    scenario_metric_value = [s for s in per_instance_stat[0].stats if s.name.name == scenario_metric_name][0].mean

I would appreciate any suggestions you might have for further improvement.

In terms of the next steps, we should resolve the questions in the section above. Then you can let me know when this is ready for a full code review, and I'll send you the requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer

Yes, I will find a time in your calendar so we can meet for pair programming. In the meantime, please let me know if you have any comments or feedback.

yifanmai

Thanks for the responses. I think your plan makes sense. Left you more detailed comments - we can merge after these are addressed.

src/helm/benchmark/reeval_runner.py

src/helm/benchmark/run_spec.py

src/helm/benchmark/run_specs/air_bench_run_specs.py

src/helm/benchmark/scenarios/mmlu_scenario.py

src/helm/benchmark/reeval_runner.py

yifanmai · 2025-03-13T16:50:24Z

Great, this approach looks good. I noticed there are still some missing pieces in the current implementation; please let me know when those are ready for review.

docs/reeval.md

src/helm/common/reeval_parameters.py

src/helm/benchmark/reeval_run.py

src/helm/benchmark/reeval_runner.py

yifanmai

Thanks! Looks good but you have to fix the type errors before we can merge.

yifanmai · 2025-04-01T22:40:26Z

docs/reeval.md

@@ -0,0 +1,32 @@
+# Reliable and Efficient Amortized Model-based Evaluation
+
+Reliable and Efficient Amortized Model-based Evaluation (Reeval) is an extension of the HELM framework for using Computerized Adaptive Testing (CAT) within the framework of Item Response Theory (IRT) to adaptively evaluate Large Language Models (LLMs). This approach selects the next question whose difficulty is closest to the estimated model ability, thereby reliably and efficiently eliciting the model's ability. The difficulties of the questions are provided on HuggingFace: [`stair-lab/reeval-difficulty-for-helm`](https://huggingface.co/datasets/stair-lab/reeval-difficulty-for-helm), which currently supports 22 scenarios in HELM. The paper's authors will supply a Python package for calculating these difficulties and will support more scenarios in the future.


nit: "(Reeval)" -> "(REEval)"

yifanmai · 2025-04-01T22:42:40Z

src/helm/benchmark/reeval_runner.py

+
+            if iteration > 0:
+                prev_ability = ability.clone()
+                prev_loss = loss.clone()


Error:

src/helm/benchmark/reeval_runner.py:127: error: "float" has no attribute "clone" [attr-defined]

loss is a float rather than a tensor (doc), so:

prev_loss = loss

yifanmai · 2025-04-02T16:40:07Z

Merged. Thanks for your hard work!

Yuheng Tu and others added 2 commits March 2, 2025 09:48

Initial code for adaptive evaluation

219c714

Merge branch 'stanford-crfm:main' into 3323-adaptive_evaluation

94b0316

second commit to apply feedback

9a182b9

yifanmai requested changes Mar 10, 2025

View reviewed changes

yuhengtu and others added 2 commits March 11, 2025 23:51

third commit on feedback

5b2588c

fourth commit to generalize without config

ccb92c4

Yuheng Tu added 2 commits March 18, 2025 11:59

fifth commit, fix all feedback

378c26b

fix linter and type checker

1e1ce9f

sangttruong marked this pull request as ready for review March 21, 2025 14:27

add document

c202b15

sangttruong requested a review from yifanmai March 22, 2025 07:21

change sign

7c6088f

yifanmai requested changes Mar 25, 2025

View reviewed changes

yifanmai changed the title ~~3323 adaptive evaluation~~ Adaptive evaluation Mar 25, 2025

Yuheng Tu added 2 commits March 29, 2025 23:57

address comments

e5ee00b

address comments 2

66518f5

sangttruong requested a review from yifanmai April 1, 2025 19:06

yifanmai approved these changes Apr 1, 2025

View reviewed changes

fix typo

5b31b46

sangttruong requested a review from yifanmai April 2, 2025 15:01

yifanmai merged commit ef9772f into stanford-crfm:main Apr 2, 2025
8 checks passed

sangttruong deleted the 3323-adaptive_evaluation branch April 2, 2025 21:22

yuhengtu mentioned this pull request Jun 11, 2025

Add a module for evaluation with adaptive question selection EleutherAI/lm-evaluation-harness#3053

Open

		@@ -0,0 +1,32 @@
		# Reliable and Efficient Amortized Model-based Evaluation

		Reliable and Efficient Amortized Model-based Evaluation (Reeval) is an extension of the HELM framework for using Computerized Adaptive Testing (CAT) within the framework of Item Response Theory (IRT) to adaptively evaluate Large Language Models (LLMs). This approach selects the next question whose difficulty is closest to the estimated model ability, thereby reliably and efficiently eliciting the model's ability. The difficulties of the questions are provided on HuggingFace: [`stair-lab/reeval-difficulty-for-helm`](https://huggingface.co/datasets/stair-lab/reeval-difficulty-for-helm), which currently supports 22 scenarios in HELM. The paper's authors will supply a Python package for calculating these difficulties and will support more scenarios in the future.

Adaptive evaluation #3397

Adaptive evaluation #3397

Uh oh!

Conversation

sangttruong commented Mar 2, 2025 • edited by yifanmai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yifanmai commented Mar 4, 2025

Uh oh!

sangttruong commented Mar 8, 2025

Uh oh!

yifanmai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifanmai commented Mar 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifanmai left a comment

Choose a reason for hiding this comment

Uh oh!

yifanmai Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yifanmai commented Apr 2, 2025

Uh oh!

Uh oh!

sangttruong commented Mar 2, 2025 •

edited by yifanmai

Loading