Medhelm judges #3484

MiguelAFH · 2025-03-27T18:40:59Z

MedHELM contains several open-ended benchmarks where the current main metric is BertScore. While this metric is widely accepted in the literature, we want to make our evaluation metrics more robust.

To do so, we propose the use of LLM as a judge for the open-ended benchmarks, where we use standaridized evaluation criteria for each scenario and judges from multiple vendors the reduce bias.

Default judges:

GPT-4o
- model_name="openai/gpt-4o-2024-05-13"
  *model_deployment="stanfordhealthcare/gpt-4o-2024-05-13"
Llama 3.3 70B Instruct
- model_name="meta/llama-3.3-70b-instruct"
- model_deployment="stanfordhealthcare/llama-3.3-70b-instruct"
Claude 3.7 sonnet
- model_name="anthropic/claude-3-7-sonnet-20250219"
- model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219"

Adding @suhana13 @aunell @HennyJie for FYI.

yifanmai

Looks good at a high level. Left some optional suggestions; feel free to address in this pull request, or merge and open a new request.

yifanmai · 2025-03-28T21:35:24Z

src/helm/benchmark/annotation/model_as_judge.py

+            hlog("WARNING: Annotator skipped sending requests because the model response was empty")
+            return {
+                "prompt_text": None,
+                "empty_output_equivalence_judgement": False,


Probably want to rename this key.

yifanmai · 2025-03-28T21:38:25Z

src/helm/benchmark/annotation/model_as_judge.py

+        prompt_template: str,
+        annotation_criteria: Dict[str, Set[str]],
+        annotator_models: Dict[str, AnnotatorModelInfo],
+        preprocessor: Optional[Callable[[str], str]] = None,


nit: output_preprocessor

yifanmai · 2025-03-28T21:39:39Z

src/helm/benchmark/annotation/model_as_judge.py

+                    # Attempt to fix incomplete JSON by adding a closing brace
+                    annotator_output = annotator_output + "}"
+                    try:
+                        annotator_criteria = json.loads(annotator_output)


annotation_criteria -> annotation?

yifanmai · 2025-03-28T21:41:59Z

src/helm/benchmark/annotation/model_as_judge.py

+        """
+        self._auto_client = auto_client
+        self._prompt_template = prompt_template
+        self._annotation_criteria = annotation_criteria


nit: annotation_schema or annotation_format

yifanmai · 2025-03-28T21:43:16Z

src/helm/benchmark/run_specs/medhelm_run_specs.py

    metric_args = {
        "task": "mtsamples_replicate",
        "device": get_torch_device_name(),
        "bertscore_model": "distilbert-base-uncased",
        "rescale_with_baseline": False,
    }

+    metric_specs = get_summarization_metric_specs(metric_args) + [
+        MetricSpec(class_name="helm.benchmark.metrics.mtsamples_replicate_metrics.MTSamplesReplicateMetric", args={})


optional nit: you can omit args={} in most specs.

yifanmai · 2025-03-28T21:48:01Z

src/helm/benchmark/metrics/aci_bench_metrics.py

+        metric_service: MetricService,
+        eval_cache_path: str,
+    ) -> List[Stat]:
+        assert request_state.annotations


Is there some way that the commonality between these metrics could be refactored out? See AnnotationLikertScaleMetric here for an example that you could follow.

yifanmai · 2025-03-28T21:49:44Z

src/helm/benchmark/annotation/aci_bench_annotator.py

+    "clarity": {"score", "explanation"},
+}
+
+ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {


If you need to switch between different predefined deployments in different environments, you could make an environment enum (e.g. "shc", "medical") and then do one of the below:

Add a parameter to the run spec function that specifies an environment enum value

Read in a shell environment variable to get the environment enum value

yifanmai · 2025-04-01T22:02:16Z

Ping @MiguelAFH - should we merge this as is?

MiguelAFH · 2025-04-01T22:07:00Z

@yifanmai Let's merge as is for now and I will update on a subsequent PR. We have been running things since last week so didn't have a chance to take care of the comments yet.

MiguelAFH added 4 commits March 21, 2025 05:08

Added medhelm annotators

ad2d0d2

Merge branch 'main' into medhelm-judges

d4574a4

Merge branch 'main' into medhelm-judges

38379d7

Update LLM as a judge

e860d31

MiguelAFH requested a review from yifanmai March 27, 2025 18:41

MiguelAFH added 3 commits March 27, 2025 20:05

Fix lint

19cc7e0

Fix lint

e3f4ebe

Fix lint

1247719

MiguelAFH self-assigned this Mar 27, 2025

yifanmai approved these changes Mar 28, 2025

View reviewed changes

MiguelAFH merged commit 77268ed into main Apr 1, 2025
8 checks passed

MiguelAFH deleted the medhelm-judges branch April 1, 2025 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Medhelm judges #3484

Medhelm judges #3484

Uh oh!

MiguelAFH commented Mar 27, 2025 •

edited

Loading

Uh oh!

yifanmai left a comment

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai Mar 28, 2025

Uh oh!

yifanmai commented Apr 1, 2025

Uh oh!

MiguelAFH commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

Medhelm judges #3484

Medhelm judges #3484

Uh oh!

Conversation

MiguelAFH commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yifanmai left a comment

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yifanmai commented Apr 1, 2025

Uh oh!

MiguelAFH commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

MiguelAFH commented Mar 27, 2025 •

edited

Loading