LLM Eval

LLM Eval
- Survey
- LLM Evaluation
- Leaderboard
- Projects
- Misc

Survey

LLM Evaluation

SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

· (cdn.openai) · (simple-evals - openai)
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance, arXiv, 2410.18889, arxiv, pdf, cication: -1

Omer Nahum, Nitay Calderon, Orgad Keller, ..., Idan Szpektor, Roi Reichart
🎬 How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh
🌟 Creating a LLM-as-a-Judge That Drives Business Results
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, arXiv, 2410.10813, arxiv, pdf, cication: -1

Di Wu, Hongwei Wang, Wenhao Yu, ..., Kai-Wei Chang, Dong Yu
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution, arXiv, 2410.16256, arxiv, pdf, cication: -1

Maosong Cao, Alexander Lam, Haodong Duan, ..., Songyang Zhang, Kai Chen · (CompassJudger - open-compass)
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models, arXiv, 2410.14059, arxiv, pdf, cication: -1

Yuzhe Yang, Yifei Zhang, Yan Hu, ..., Honghai Yu, Benyou Wang
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures, arXiv, 2410.13754, arxiv, pdf, cication: -1

Jinjie Ni, Yifan Song, Deepanway Ghosal, ..., Yang You, Michael Shieh
JudgeBench: A Benchmark for Evaluating LLM-based Judges, arXiv, 2410.12784, arxiv, pdf, cication: -1

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, ..., Raluca Ada Popa, Ion Stoica · (JudgeBench - ScalerLab)
Large Language Model Evaluation via Matrix Nuclear-Norm, arXiv, 2410.10672, arxiv, pdf, cication: -1

Yahan Li, Tingyu Xia, Yi Chang, ..., Yuan Wu · (MatrixNuclearNorm - MLGroupJLU)

Leaderboard

TIGER-Lab / MMLU-Pro 🤗

Projects

🌟 simple-evals - openai

Misc

Chatbot Arena Categories

· (x)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_eval.md

llm_eval.md

LLM Eval

Survey

LLM Evaluation

Leaderboard

Projects

Misc

Files

llm_eval.md

Latest commit

History

llm_eval.md

File metadata and controls

LLM Eval

Survey

LLM Evaluation

Leaderboard

Projects

Misc