Skip to content

Conversation

sangttruong
Copy link
Contributor

@sangttruong sangttruong commented Mar 2, 2025

Improving evaluation efficiency by selecting informative questions via a priority queue dataloader. We demonstrate adaptive evaluation on AIR-Bench. To run:

export SUITE_NAME=adaptive
export MODELS_TO_RUN=meta-llama/Llama-3.2-3B-Instruct
export HF_MODEL=meta-llama/Llama-3.2-3B-Instruct
export RUN_ENTRIES_CONF_PATH=src/helm/benchmark/presentation/run_entries_adaptive_air_bench.conf
export SCHEMA_PATH=schema_air_bench.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=2
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --enable-huggingface-models $HF_MODEL --dry-run

Please let us know if your comments and feedback.

Addresses #3323

@yifanmai
Copy link
Collaborator

yifanmai commented Mar 4, 2025

Thanks! Overall this looks good and I think we can merge it with some changes.

Questions:

  • What is the positioning of this tool? Is it meant to only be used to reproduce the paper's results, or to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?
  • Currently the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?
  • For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users should provide the parameters themselves, or should the initial model ability be set to some constant?
  • Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

High level feedback:

  • Changes to core framework code should be minimized, to avoid increasing software complexity and maintenance costs, and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there, rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.
  • If this is an experimental tool, it should be clear made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.
  • Unfortunately AIR-Bench is not a good application for this, because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.
  • Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not true generally for all datasets.

In terms of next steps, we should resolve the questions in the section above, and then you can let me know when this is ready for a full code review, and I'll send you requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer.

@sangttruong
Copy link
Contributor Author

Hi @yifanmai! Thank you so much for your comment. We addressed most of them in the latest commit and included the reply to your query below.

Questions

What is the positioning of this tool? Is it meant to be used only to reproduce the paper's results, to run new models on the supported benchmarks, or to support all benchmarks in HELM eventually? What level of experimental vs production-ready is it?

We aim to make the tool production-ready and support most benchmarks in HELM.

Currently, the model ability and instance ability parameters are unspecified. Are you planning to provide the parameters from the paper as a dataset?

We will upload the difficulty on HuggingFace.

For the initial model ability parameter for new models, are you planning to provide parameters for new models, or should the users provide the parameters themselves, or should the initial model ability be set to some constant?

We plan to set a default initial value of model ability (e.g., 0). The users can also set it (e.g., when they have good prior knowledge about the model's ability).

Is there going to be a separate tool to run the EM for the model ability and instance difficulty parameters for new models and datasets, or will the parameters be only available for the models and datasets from the original paper?

We will provide a separate Python package to compute the calibration and determine the difficulties of the questions of new datasets.

High-level feedback

Changes to core framework code should be minimized to avoid increasing software complexity and maintenance costs and to minimize the risk of breakages for existing users. In particular, you should subclass Runner and make your changes there rather than changing Runner itself. Users can use your runner subclass using the --runner-class parameter.

We have subclassed a ReevalRunner in a separate file reeval_runner.py, and run the code using --runner-class-name helm.benchmark.reeval_runner.ReevalRunner. Now, the original runner code stays the same as the original code.
We used to add two parameters reeval_mode: bool = False and reeval_max_samples: int = 50 to the src/helm/benchmark/run_spec.py. Then, we create some files for each dataset. Using AIR-Bench as an example: we create run_entries_adaptive_air_bench.conf and add function get_adaptive_air_bench_2024_spec to air_bench_run_specs.py.

If this is an experimental tool, it should be made clear to users that the tool may be unsupported or may not work for all use cases generally. I'm not a fan of naming this tool "adaptive" for this reason - it is a very generic name, and I prefer names that are derived from the paper title or method name.

We have changed all the adaptive names in the code to reeval (reliable and efficient evaluation), which is an abbreviation for our method. We can be very flexible with the naming. We will note in the documentation which benchmark will be supported by the tools.

Unfortunately, AIR-Bench is not a good application for this because the objective of that dataset is to give a score for each safety category, whereas the adaptive method only aims to estimate the top-level score. You could get unreliable or missing category score estimates if the adaptive sampler mostly or entirely skips over certain categories. I suppose the dataset selection cannot be changed easily at this point.

Besides Air-Bench, we implement our method on MMLU, which might not have the above problems.
To run MMLU using ReevalRunner:

cd src
export SUITE_NAME=subclass_reeval_mmlu_simple_model1
export MODELS_TO_RUN=simple/model1
export RUN_ENTRIES_CONF_PATH=helm/benchmark/presentation/run_entries_reeval_mmlu.conf
export SCHEMA_PATH=schema_mmlu.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10000
export PRIORITY=4
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN --runner-class-name helm.benchmark.reeval_runner.ReevalRunner

Your code assumes that the first metric in the list of metrics is the main accuracy metric. I'm not sure if this is true for all your selected datasets, but it is not generally true for all datasets.

We now use the run_spec to find the main metric name and search list of metrics.

    scenario_metric_name = run_spec.metric_specs[0].args["names"][0]
    scenario_metric_value = [s for s in per_instance_stat[0].stats if s.name.name == scenario_metric_name][0].mean

I would appreciate any suggestions you might have for further improvement.

In terms of the next steps, we should resolve the questions in the section above. Then you can let me know when this is ready for a full code review, and I'll send you the requested changes. Alternatively, you can set up some time to do some synchronous pair programming if you would prefer

Yes, I will find a time in your calendar so we can meet for pair programming. In the meantime, please let me know if you have any comments or feedback.

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses. I think your plan makes sense. Left you more detailed comments - we can merge after these are addressed.

@yifanmai
Copy link
Collaborator

Great, this approach looks good. I noticed there are still some missing pieces in the current implementation; please let me know when those are ready for review.

@sangttruong sangttruong marked this pull request as ready for review March 21, 2025 14:27
@sangttruong sangttruong requested a review from yifanmai March 22, 2025 07:21
@yifanmai yifanmai changed the title 3323 adaptive evaluation Adaptive evaluation Mar 25, 2025
@sangttruong sangttruong requested a review from yifanmai April 1, 2025 19:06
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good but you have to fix the type errors before we can merge.

docs/reeval.md Outdated
@@ -0,0 +1,32 @@
# Reliable and Efficient Amortized Model-based Evaluation

Reliable and Efficient Amortized Model-based Evaluation (Reeval) is an extension of the HELM framework for using Computerized Adaptive Testing (CAT) within the framework of Item Response Theory (IRT) to adaptively evaluate Large Language Models (LLMs). This approach selects the next question whose difficulty is closest to the estimated model ability, thereby reliably and efficiently eliciting the model's ability. The difficulties of the questions are provided on HuggingFace: [`stair-lab/reeval-difficulty-for-helm`](https://huggingface.co/datasets/stair-lab/reeval-difficulty-for-helm), which currently supports 22 scenarios in HELM. The paper's authors will supply a Python package for calculating these difficulties and will support more scenarios in the future.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "(Reeval)" -> "(REEval)"


if iteration > 0:
prev_ability = ability.clone()
prev_loss = loss.clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error:

src/helm/benchmark/reeval_runner.py:127: error: "float" has no attribute "clone"  [attr-defined]

loss is a float rather than a tensor (doc), so:

prev_loss = loss

@sangttruong sangttruong requested a review from yifanmai April 2, 2025 15:01
@yifanmai yifanmai merged commit ef9772f into stanford-crfm:main Apr 2, 2025
8 checks passed
@yifanmai
Copy link
Collaborator

yifanmai commented Apr 2, 2025

Merged. Thanks for your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants