-
SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
· (cdn.openai) · (simple-evals - openai)
-
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance,
arXiv, 2410.18889
, arxiv, pdf, cication: -1Omer Nahum, Nitay Calderon, Orgad Keller, ..., Idan Szpektor, Roi Reichart
-
🎬 How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory,
arXiv, 2410.10813
, arxiv, pdf, cication: -1Di Wu, Hongwei Wang, Wenhao Yu, ..., Kai-Wei Chang, Dong Yu
-
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution,
arXiv, 2410.16256
, arxiv, pdf, cication: -1Maosong Cao, Alexander Lam, Haodong Duan, ..., Songyang Zhang, Kai Chen · (CompassJudger - open-compass)
-
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models,
arXiv, 2410.14059
, arxiv, pdf, cication: -1Yuzhe Yang, Yifei Zhang, Yan Hu, ..., Honghai Yu, Benyou Wang
-
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures,
arXiv, 2410.13754
, arxiv, pdf, cication: -1Jinjie Ni, Yifan Song, Deepanway Ghosal, ..., Yang You, Michael Shieh
-
JudgeBench: A Benchmark for Evaluating LLM-based Judges,
arXiv, 2410.12784
, arxiv, pdf, cication: -1Sijun Tan, Siyuan Zhuang, Kyle Montgomery, ..., Raluca Ada Popa, Ion Stoica · (JudgeBench - ScalerLab)
-
Large Language Model Evaluation via Matrix Nuclear-Norm,
arXiv, 2410.10672
, arxiv, pdf, cication: -1Yahan Li, Tingyu Xia, Yi Chang, ..., Yuan Wu · (MatrixNuclearNorm - MLGroupJLU)
- 🌟 simple-evals - openai