Skip to content

Conversation

MiguelAFH
Copy link
Collaborator

@MiguelAFH MiguelAFH commented Mar 27, 2025

MedHELM contains several open-ended benchmarks where the current main metric is BertScore. While this metric is widely accepted in the literature, we want to make our evaluation metrics more robust.

To do so, we propose the use of LLM as a judge for the open-ended benchmarks, where we use standaridized evaluation criteria for each scenario and judges from multiple vendors the reduce bias.

Default judges:

  • GPT-4o
    • model_name="openai/gpt-4o-2024-05-13"
      *model_deployment="stanfordhealthcare/gpt-4o-2024-05-13"
  • Llama 3.3 70B Instruct
    • model_name="meta/llama-3.3-70b-instruct"
    • model_deployment="stanfordhealthcare/llama-3.3-70b-instruct"
  • Claude 3.7 sonnet
    • model_name="anthropic/claude-3-7-sonnet-20250219"
    • model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219"

Adding @suhana13 @aunell @HennyJie for FYI.

@MiguelAFH MiguelAFH requested a review from yifanmai March 27, 2025 18:41
@MiguelAFH MiguelAFH self-assigned this Mar 27, 2025
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at a high level. Left some optional suggestions; feel free to address in this pull request, or merge and open a new request.

hlog("WARNING: Annotator skipped sending requests because the model response was empty")
return {
"prompt_text": None,
"empty_output_equivalence_judgement": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to rename this key.

prompt_template: str,
annotation_criteria: Dict[str, Set[str]],
annotator_models: Dict[str, AnnotatorModelInfo],
preprocessor: Optional[Callable[[str], str]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: output_preprocessor

# Attempt to fix incomplete JSON by adding a closing brace
annotator_output = annotator_output + "}"
try:
annotator_criteria = json.loads(annotator_output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotation_criteria -> annotation?

"""
self._auto_client = auto_client
self._prompt_template = prompt_template
self._annotation_criteria = annotation_criteria
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: annotation_schema or annotation_format

metric_args = {
"task": "mtsamples_replicate",
"device": get_torch_device_name(),
"bertscore_model": "distilbert-base-uncased",
"rescale_with_baseline": False,
}

metric_specs = get_summarization_metric_specs(metric_args) + [
MetricSpec(class_name="helm.benchmark.metrics.mtsamples_replicate_metrics.MTSamplesReplicateMetric", args={})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: you can omit args={} in most specs.

metric_service: MetricService,
eval_cache_path: str,
) -> List[Stat]:
assert request_state.annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way that the commonality between these metrics could be refactored out? See AnnotationLikertScaleMetric here for an example that you could follow.

"clarity": {"score", "explanation"},
}

ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to switch between different predefined deployments in different environments, you could make an environment enum (e.g. "shc", "medical") and then do one of the below:

  1. Add a parameter to the run spec function that specifies an environment enum value
  2. Read in a shell environment variable to get the environment enum value

@yifanmai
Copy link
Collaborator

yifanmai commented Apr 1, 2025

Ping @MiguelAFH - should we merge this as is?

@MiguelAFH
Copy link
Collaborator Author

@yifanmai Let's merge as is for now and I will update on a subsequent PR. We have been running things since last week so didn't have a chance to take care of the comments yet.

@MiguelAFH MiguelAFH merged commit 77268ed into main Apr 1, 2025
8 checks passed
@MiguelAFH MiguelAFH deleted the medhelm-judges branch April 1, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants