Skip to content

Feat: Add accuracy evaluation for LLMs (GPQA, AIME, HLE etc.) #4

@nvzhihanj

Description

@nvzhihanj

Add datasets, post-processing scripts, environment (likely docker) to evaluate the accuracy based on the output collected from the inference endpoints.
List of accuracy eval to add:

  • GPQA
  • MMLU, MMLU-Pro
  • AIME
  • MATH500
  • HLE
  • Health Bench
  • TBD

Metadata

Metadata

Assignees

Labels

area: evaluationAccuracy evaluation, scoring, extractorspriority: ShowStopperDrop everything — critical blocker, all hands on decktype: featureNew feature or capability

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions