AI Bench Archive

A repository that aggregates commonly used benchmarks to test LLM and multimodal models on different modalities and functionalites.

Leaderboards

General capability leaderboards

Chatbot Arena Leaderboard - an open source platform for evaluating AI through human preference, developed by researchers at UC Berkeley
Open LLM Leaderboard - hosted by HuggingFace, a leaderboard testing open source models on various benchmarks.'
ProLLM Leaderboard - holds leaderboard evaluating coding assistance, summarization, judging capabilites, entity extraction and many more
Judge Arena - community voted leaderboard regarding LLM as a judge capabilites made by Atla

Coding leaderboards

Aider LLM Leaderboards - hosted by Aider, a pair programming solution. Has leaderboards for Code Editing and Code Refactoring using LLMs
Can AI Code Leaderboard - hosted on HuggingFace, leaderboard summarizing results of the CanAICode test suite
BigCodeBench leaderboard - hosted on HuggingFace, leaderboard summarizing the results of the BigCodeBench benchmark.

General capabilities

Overall capability text-based benchmarks

GPQA - Graduate Level, Google-Proof Q&A Benchmark. Designed to be more challenging than MMLU. Consists of 448 multiple-choice questions in biology, physics and chemistry. Human PhD experts achieve 65% accuracy. Non-experts with unrestricted access to the internet reach only 34%.
MMLU - Massive Multitask Language Understanding, testing undergraduate level knowledge. Published in 2021, one of most commonly used benchmarks. GPT-4o level models hover around 88%. While not without flaws, it's results are commonly referenced when talking about the performance of AI models.
BIG-Bench-Hard - a benchmark of mixed evaluations, which is an extension of BIG-Bench. Out of original 204 tasks, 23 most challenging tasks were chosen for this benchmark. Many of those require multi-step reasoning.
DROP - Discrete Reasoning Over Text, a crowdsourced 96k-question benchmark. Requires the system to perform discrete operations, like addition, counting and sorting. The questions consist of passages extracted from Wikipedia articles.
HellaSWAG - otherwise called "Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations". Made in 2019. It is a common knowledge benchmark that is designed to be challenging for LLMs, but trivial for humans (>95% accuracy).
WinoGrande - a large scale dataset of 44k problems, introduced in 2019.
ARC, the AI2 Reasoning Challenge - not to be confused with ARC-AGI. This is a multiple choice question-answering dataset which contains questions from science exams from grade 3 to 9. Split into two parts, Easy in Challenge, latter part requires reasoning.
TruthfulQA - a benchmark to measure whether a language model is truthful in answering questions. Has 817 questions that are interdisciplinary, including health, law, finance and politics.
MT-Bench - contains 3.3k expert level human preferences for model responses generated by 6 models in response to 80 MT-bench questions. Was introduced in the LLM-as-a-judge paper in 2023.
LiveBench - benchmarking initiative created in June 2024. Started with 17 diverse tasks initially, with new harder tasks to be added over time.
AIME - American Invitational Mathematics Examination but in the benchmark form
Simple Bench - multiple choice text benchmark for LLMs, high school grade but outperforming SOTA models. Over 200 questions including trick questions, social intelligence and spatio-temporal reasoning.
GLUE - General Language Understanding benchmark, which is a collection of nine sentence-pair language understanding tasks built on established datasets, covering diverse range of genres and degrees of difficulty

Towards AGI

GAIA - benchmark published by Meta and Yann LeCun, which measures reasoning, multi-modality handling, web-browsing and tool use. Like Hella-Swag, it is designed to be easy for humans but challenging for General AI Assistants.
ARC-AGI - created by François Chollet (creator of the Keras library), which was created to benchmark the measure the ability of AI systems to solve novel reasoning problems. This benchmark also comes with a prize of 1 million US dollars when beat.
MMMU - "Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI", a benchmark that covers disciplines such as engineering, art and design, science in general, humanities and social studies, business cases and medicine. It has images with diagrams, tables, plots and charts, photograps, chemical structures and many more. Covers cases where both text and images are mixed.

Other multimodal benchmarks

MathVista (testmini) - is a benchmark combining challenges from diverse mathematical and visual tasks. It consists of 6.141 examples from 28 existing multimodal datasets involving mathematics.
Chart Q&A - contains variety of complex questions related to charts
AI2D - contains over 5000 grade school science diagrams with over 150 000 rich annotations with their ground truth syntatic parsers, and more than 15 000 multiple choice questions
ANLS and ANLS* - benchmark evaluating a wide variety of tasks, including information extraction and classification tasks from documents

Coding benchmarks

HumanEval - used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. It was made in 2021, being one of the most commonly ran benchmarks in the 2023-2024 period. Currently GPT-4 level or better models score above 90%, necessitating the creation and use of better benchmarks for the frontier models.
MathQA-Python - a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text, introduced by Google Research
MBPP - Mostly Basic Python Programming is a benchmark consisting of around a thousand crowd sourced Python programming problems, designed to be solvable by entry-level programmers. Each problem consists of a task description, code solution and 3 automated test cases.
BigCodeBench - a dataset of 1.14k problems and solutions related to coding in Python. Published on 7th of October 2024.

Tool use

BFCL - Berkeley Function Calling Leaderboard, which evaluates the ability of different LLM to call functions/tools. Covers basic use cases and agentic workflows.
MetaTool - a benchmark designed to asses the ability of LLMs to use correct tools given situation. Includes ToolE dataset, which contains prompts triggering single and multi tool use.
API-Bank - is a runnable evaluation system consisting of 73 APIs published in 2023.

Agent evaluation

WorkArena++ - a suite of browser-based tasks to measure web agent performance - also a part of BrowserGym which holds many smaller agentic benchmarks.

Field-based capabilities

Math benchmarks

MATH - a dataset of 12 500 challenging competition mathematics problems. Each problem has a step-by-step solution that can be used to teach models to generate answer derivations and explanations.
MGSM - Multilingual Grade School Math Benchmark is a collection of grade-school math problems. Consists of 250 problems from GSM8K translated across 10 languages.
GSM8K - introduced in 2021, a dataset of 8.5k quality 8th grade math problems. Each problem takes between 2 to 8 steps to solve, necessitating use of basic arithmetic operations to reach the final answer.
Putnam Bench - multi language benchmark for evaluation of the ability to solve competition mathematic problems. Includes 1692 formalizations of 640 problems from the William Lowell Putnam Mathematical Competition, college level problems.

Legal benchmarks

LegalBench - a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. Made by subject experts between many highly esteemed universities, this benchmark measures legal reasoning capabilites that are useful to lawyers and reasoning skills that are interesting to them.

Biology benchmarks

LAB-Bench - Language Agent Biology Benchmark, evaluates LLM capabilites on literature search, protocol planning and data analysis.

Language specific

Polish language

KLEJ - stands for Kompleksowa Lista Ewaluacji Językowych, which is the polish translation/variant of before mentioned GLUE benchmark, made by Allegro

Provided by the team from SpeakLeash:

Open PL LLM Leaderboard - which uses polemo2, KLEJ , polqa, and many more.
EQ-Bench - Polish Emotional Intelligence benchmark, using the arena format to gauge the emotional capability of LLMs
MT-Bench - a polish version of the mentioned before MT-Bench
Polish Medical Leaderboard - uses the PES-2018-2022 dataset to gauge the LLM capability to answer medical questions
CPTUB - Complex Polish Understanding Benchmark, evaluates the capability to correctly interpret complext texts, sarcasm, implicatures and phrases.

Resources used

Claude 3.5 Sonnet announcement, attached used benchmarks - GPQA, MMLU, BIG-Bench-Hard, DROP, MATH, MGSM, GSM8K, HumanEval, multimodal ones - MathVista (testmini), AI2D, Chart Q&A, ANLS
GPT-4 technical report - attached used benchmarks not mentioned above - HellaSwag, ARC, WinoGrade
HuggingFace - attached leaderboards and benchmarks used there

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Bench Archive

Leaderboards

General capability leaderboards

Coding leaderboards

General capabilities

Overall capability text-based benchmarks

Towards AGI

Other multimodal benchmarks

Coding benchmarks

Tool use

Agent evaluation

Field-based capabilities

Math benchmarks

Legal benchmarks

Biology benchmarks

Language specific

Polish language

Resources used

About

Uh oh!

Releases

Packages

License

Scimoose/AI-Bench-Archive

Folders and files

Latest commit

History

Repository files navigation

AI Bench Archive

Leaderboards

General capability leaderboards

Coding leaderboards

General capabilities

Overall capability text-based benchmarks

Towards AGI

Other multimodal benchmarks

Coding benchmarks

Tool use

Agent evaluation

Field-based capabilities

Math benchmarks

Legal benchmarks

Biology benchmarks

Language specific

Polish language

Resources used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages