A repository that aggregates commonly used benchmarks to test LLM and multimodal models on different modalities and functionalites.
- Chatbot Arena Leaderboard - an open source platform for evaluating AI through human preference, developed by researchers at UC Berkeley
- Open LLM Leaderboard - hosted by HuggingFace, a leaderboard testing open source models on various benchmarks.'
- ProLLM Leaderboard - holds leaderboard evaluating coding assistance, summarization, judging capabilites, entity extraction and many more
- Judge Arena - community voted leaderboard regarding LLM as a judge capabilites made by Atla
- Aider LLM Leaderboards - hosted by Aider, a pair programming solution. Has leaderboards for Code Editing and Code Refactoring using LLMs
- Can AI Code Leaderboard - hosted on HuggingFace, leaderboard summarizing results of the CanAICode test suite
- BigCodeBench leaderboard - hosted on HuggingFace, leaderboard summarizing the results of the BigCodeBench benchmark.
- GPQA - Graduate Level, Google-Proof Q&A Benchmark. Designed to be more challenging than MMLU. Consists of 448 multiple-choice questions in biology, physics and chemistry. Human PhD experts achieve 65% accuracy. Non-experts with unrestricted access to the internet reach only 34%.
- MMLU - Massive Multitask Language Understanding, testing undergraduate level knowledge. Published in 2021, one of most commonly used benchmarks. GPT-4o level models hover around 88%. While not without flaws, it's results are commonly referenced when talking about the performance of AI models.
- BIG-Bench-Hard - a benchmark of mixed evaluations, which is an extension of BIG-Bench. Out of original 204 tasks, 23 most challenging tasks were chosen for this benchmark. Many of those require multi-step reasoning.
- DROP - Discrete Reasoning Over Text, a crowdsourced 96k-question benchmark. Requires the system to perform discrete operations, like addition, counting and sorting. The questions consist of passages extracted from Wikipedia articles.
- HellaSWAG - otherwise called "Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations". Made in 2019. It is a common knowledge benchmark that is designed to be challenging for LLMs, but trivial for humans (>95% accuracy).
- WinoGrande - a large scale dataset of 44k problems, introduced in 2019.
- ARC, the AI2 Reasoning Challenge - not to be confused with ARC-AGI. This is a multiple choice question-answering dataset which contains questions from science exams from grade 3 to 9. Split into two parts, Easy in Challenge, latter part requires reasoning.
- TruthfulQA - a benchmark to measure whether a language model is truthful in answering questions. Has 817 questions that are interdisciplinary, including health, law, finance and politics.
- MT-Bench - contains 3.3k expert level human preferences for model responses generated by 6 models in response to 80 MT-bench questions. Was introduced in the LLM-as-a-judge paper in 2023.
- LiveBench - benchmarking initiative created in June 2024. Started with 17 diverse tasks initially, with new harder tasks to be added over time.
- AIME - American Invitational Mathematics Examination but in the benchmark form
- Simple Bench - multiple choice text benchmark for LLMs, high school grade but outperforming SOTA models. Over 200 questions including trick questions, social intelligence and spatio-temporal reasoning.
- GLUE - General Language Understanding benchmark, which is a collection of nine sentence-pair language understanding tasks built on established datasets, covering diverse range of genres and degrees of difficulty
- GAIA - benchmark published by Meta and Yann LeCun, which measures reasoning, multi-modality handling, web-browsing and tool use. Like Hella-Swag, it is designed to be easy for humans but challenging for General AI Assistants.
- ARC-AGI - created by François Chollet (creator of the Keras library), which was created to benchmark the measure the ability of AI systems to solve novel reasoning problems. This benchmark also comes with a prize of 1 million US dollars when beat.
- MMMU - "Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI", a benchmark that covers disciplines such as engineering, art and design, science in general, humanities and social studies, business cases and medicine. It has images with diagrams, tables, plots and charts, photograps, chemical structures and many more. Covers cases where both text and images are mixed.
- MathVista (testmini) - is a benchmark combining challenges from diverse mathematical and visual tasks. It consists of 6.141 examples from 28 existing multimodal datasets involving mathematics.
- Chart Q&A - contains variety of complex questions related to charts
- AI2D - contains over 5000 grade school science diagrams with over 150 000 rich annotations with their ground truth syntatic parsers, and more than 15 000 multiple choice questions
- ANLS and ANLS* - benchmark evaluating a wide variety of tasks, including information extraction and classification tasks from documents
- HumanEval - used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. It was made in 2021, being one of the most commonly ran benchmarks in the 2023-2024 period. Currently GPT-4 level or better models score above 90%, necessitating the creation and use of better benchmarks for the frontier models.
- MathQA-Python - a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text, introduced by Google Research
- MBPP - Mostly Basic Python Programming is a benchmark consisting of around a thousand crowd sourced Python programming problems, designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution and 3 automated test cases.
- BigCodeBench - a dataset of 1.14k problems and solutions related to coding in Python. Published on 7th of October 2024.
- BFCL - Berkeley Function Calling Leaderboard, which evaluates the ability of different LLM to call functions/tools. Covers basic use cases and agentic workflows.
- MetaTool - a benchmark designed to asses the ability of LLMs to use correct tools given situation. Includes ToolE dataset, which contains prompts triggering single and multi tool use.
- API-Bank - is a runnable evaluation system consisting of 73 APIs published in 2023.
- WorkArena++ - a suite of browser-based tasks to measure web agent performance - also a part of BrowserGym which holds many smaller agentic benchmarks.
- MATH - a dataset of 12 500 challenging competition mathematics problems. Each problem has a step-by-step solution that can be used to teach models to generate answer derivations and explanations.
- MGSM - Multilingual Grade School Math Benchmark is a collection of grade-school math problems. Consists of 250 problems from GSM8K translated across 10 languages.
- GSM8K - introduced in 2021, a dataset of 8.5k quality 8th grade math problems. Each problem takes between 2 to 8 steps to solve, necessitating use of basic arithmetic operations to reach the final answer.
- Putnam Bench - multi language benchmark for evaluation of the ability to solve competition mathematic problems. Includes 1692 formalizations of 640 problems from the William Lowell Putnam Mathematical Competition, college level problems.
- LegalBench - a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. Made by subject experts between many highly esteemed universities, this benchmark measures legal reasoning capabilites that are useful to lawyers and reasoning skills that are interesting to them.
- LAB-Bench - Language Agent Biology Benchmark, evaluates LLM capabilites on literature search, protocol planning and data analysis.
- KLEJ - stands for Kompleksowa Lista Ewaluacji Językowych, which is the polish translation/variant of before mentioned GLUE benchmark, made by Allegro
Provided by the team from SpeakLeash:
- Open PL LLM Leaderboard - which uses polemo2, KLEJ , polqa, and many more.
- EQ-Bench - Polish Emotional Intelligence benchmark, using the arena format to gauge the emotional capability of LLMs
- MT-Bench - a polish version of the mentioned before MT-Bench
- Polish Medical Leaderboard - uses the PES-2018-2022 dataset to gauge the LLM capability to answer medical questions
- CPTUB - Complex Polish Understanding Benchmark, evaluates the capability to correctly interpret complext texts, sarcasm, implicatures and phrases.
- Claude 3.5 Sonnet announcement, attached used benchmarks - GPQA, MMLU, BIG-Bench-Hard, DROP, MATH, MGSM, GSM8K, HumanEval, multimodal ones - MathVista (testmini), AI2D, Chart Q&A, ANLS
- GPT-4 technical report - attached used benchmarks not mentioned above - HellaSwag, ARC, WinoGrade
- HuggingFace - attached leaderboards and benchmarks used there